Martin Belanger [Tue, 8 Feb 2022 19:33:45 +0000 (14:33 -0500)]
nvme: send uevent on connection up
When connectivity with a controller is lost, the driver will keep
trying to reconnect once every 10 sec. When connection is restored,
user-space apps need to be informed so that they can take proper
action. For example, TP8010 introduces the DIM PDU, which is used to
register with a discovery controller (DC). The DIM PDU is sent from
user-space. The DIM PDU must be sent every time a connection is
established with a DC. Therefore, the kernel must tell user-space apps
when connection is restored so that registration can happen.
The uevent sent is a "change" uevent with environmental data
set to: "NVME_EVENT=connected".
Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: John Meneghini <jmeneghi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Kanchan Joshi [Thu, 10 Feb 2022 05:37:55 +0000 (11:07 +0530)]
nvme: add vectored-io support for user-passthrough
Add a new NVME_IOCTL_IO64_CMD_VEC ioctl that works like the existing
NVME_IOCTL_IO64_CMD ioctl except that it takes and array of iovecs
and thus supports vectored I/O.
- cmd.addr is base address of user iovec array
- cmd.vec_cnt is count of iovec array elements
This patch does not include vectored-variant for admin-commands as most
of them are light on buffers and likely to have low invocation frequency.
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Alan Adamson [Thu, 3 Feb 2022 08:11:53 +0000 (00:11 -0800)]
nvme: add verbose error logging
Improves logging of NVMe errors. If NVME_VERBOSE_ERRORS is configured,
a verbose description of the error is logged, otherwise only status
codes/bits is logged.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
[kch]: fix several nits, cosmetics, and trim down code.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Alan Adamson <alan.adamson@oracle.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Chaitanya Kulkarni [Thu, 10 Feb 2022 19:12:36 +0000 (11:12 -0800)]
nvme: add a helper to initialize connect_q
Add and use helper to remove duplicate code for fabrics connect_q
initialization and error handling for all the transports.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Max Gurtovoy [Wed, 9 Feb 2022 08:54:49 +0000 (10:54 +0200)]
nvme-rdma: add helpers for mapping/unmapping request
Introduce nvme_rdma_dma_map_req/nvme_rdma_dma_unmap_req helper functions
to improve code readability and ease on the error flow.
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Sagi Grimberg [Mon, 14 Feb 2022 09:07:32 +0000 (11:07 +0200)]
nvmet-tcp: replace ida_simple[get|remove] with the simler ida_[alloc|free]
ida_simple_[get|remove] are wrappers anyways.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Sagi Grimberg [Mon, 14 Feb 2022 09:07:31 +0000 (11:07 +0200)]
nvmet-rdma: replace ida_simple[get|remove] with the simler ida_[alloc|free]
ida_simple_[get|remove] are wrappers anyways.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Sagi Grimberg [Mon, 14 Feb 2022 09:07:30 +0000 (11:07 +0200)]
nvmet-fc: replace ida_simple[get|remove] with the simler ida_[alloc|free]
ida_simple_[get|remove] are wrappers anyways.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Sagi Grimberg [Mon, 14 Feb 2022 09:07:29 +0000 (11:07 +0200)]
nvmet: replace ida_simple[get|remove] with the simler ida_[alloc|free]
ida_simple_[get|remove] are wrappers anyways.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Sagi Grimberg [Mon, 14 Feb 2022 09:07:28 +0000 (11:07 +0200)]
nvme-fc: replace ida_simple[get|remove] with the simler ida_[alloc|free]
ida_simple_[get|remove] are wrappers anyways.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Sagi Grimberg [Mon, 14 Feb 2022 09:07:27 +0000 (11:07 +0200)]
nvme: replace ida_simple[get|remove] with the simler ida_[alloc|free]
ida_simple_[get|remove] are wrappers anyways.
Also, use ida_alloc_min with the ns_ida as namespace
enumeration starts with 1.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Chaitanya Kulkarni [Wed, 2 Feb 2022 09:04:45 +0000 (01:04 -0800)]
nvmet: allow bdev in buffered_io mode
Allow block device to be configured in the buffered I/O mode by using
the file backend. In this way now we can use cache for the block
device namespace which shows significant performance improvement.
We update the block device ns enable function and return early when
buffered_io flag is set.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Chaitanya Kulkarni [Wed, 2 Feb 2022 09:04:44 +0000 (01:04 -0800)]
nvmet: use i_size_read() to set size for file-ns
Instead of calling vfs_getattr() use i_size_read() to read the size of
file so we can read the size of not only file type but also block type
with one call. This is needed to implement buffered_io support for the
NVMeOF block device backend.
We also change return type of function nvmet_file_ns_revalidate() from
int to void, since this function does not return any meaning value.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Chaitanya Kulkarni [Wed, 12 Jan 2022 06:20:58 +0000 (22:20 -0800)]
nvme-fabrics: remove unnecessary braces for case
Braces are not required for enum value NVME_SC_CONNECT_INVALID_PARAM
when used on the switch-case statement, remove the braces.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Chaitanya Kulkarni [Wed, 12 Jan 2022 06:20:57 +0000 (22:20 -0800)]
nvme-fabrics: use consistent zeroout pattern
Remove zeroout memeset call & zeroout local variable cmd at the time
of declaration in nvmf_ref_read32() similar to what we have done in
nvmf_reg_read64(), nvmf_reg_write32(), nvmf_connect_admin_queue(), and
nvmf_connect_io_queue().
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Chaitanya Kulkarni [Wed, 12 Jan 2022 06:21:00 +0000 (22:21 -0800)]
nvme-fabrics: use unsigned int type
Loop variable i will never have a negative value, so use
unsigned int type instaed of int.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Chaitanya Kulkarni [Wed, 12 Jan 2022 06:20:59 +0000 (22:20 -0800)]
nvme-fabrics: use unsigned int type
Loop variable i will never have a negative value, so use
unsigned int type instaed of int.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Chaitanya Kulkarni [Sat, 22 Jan 2022 05:05:39 +0000 (21:05 -0800)]
nvme-core: remove unnecessary function parameter
In function nvme_execute_rq() we don't use gendisk parameter at all.
Remove the unsed parameter and adjust the calls.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Chaitanya Kulkarni [Wed, 19 Jan 2022 07:49:54 +0000 (23:49 -0800)]
nvme-core: remove unnecessary semicolon
It is not a good practice to have a semicolon at the end of the
function definition. Remove it from nvme_pr_type().
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Qinghua Jin [Fri, 7 Jan 2022 02:22:58 +0000 (10:22 +0800)]
nvme-fc: fix a typo
subsytem -> subsystem
Signed-off-by: Qinghua Jin <qhjin.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Chaitanya Kulkarni [Tue, 22 Feb 2022 15:28:52 +0000 (07:28 -0800)]
null_blk: null_alloc_page() cleanup
Remove goto labels and use direct returns as error unwinding code only
needs to free t_page variable if we alloc_pages() call fails as having
two labels for one kfree() can be avoided easily.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220222152852.26043-3-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chaitanya Kulkarni [Tue, 22 Feb 2022 15:28:51 +0000 (07:28 -0800)]
null_blk: remove hardcoded null_alloc_page() param
Only caller of null_alloc_page() is null_insert_page() unconditionally
sets only parameter to GFP_NOIO and that is statically hard-coded in
null_blk. There is no point in having statically hardcoded function
parameter.
Remove the unnecessary parameter gfp_flags and adjust the code, so it
can retain existing behavior null_alloc_page() with GFP_NOIO.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220222152852.26043-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chaitanya Kulkarni [Wed, 16 Feb 2022 17:29:45 +0000 (09:29 -0800)]
null_blk: remove hardcoded alloc_cmd() parameter
Only caller of alloc_cmd() is null_submit_bio() unconditionally sets
second parameter to true and that is statically hard-coded in null_blk.
There is no point in having statically hardcoded function parameter.
Remove the unnecessary parameter can_wait and adjust the code so it
can retain existing behavior of waiting when we don't get valid
nullb_cmd from __alloc_cmd() in alloc_cmd().
The restructured code avoids multiple return statements, multiple
calls to __alloc_cmd() and resulting a fast path call to
prepare_to_wait() due to removal of first alloc_cmd() call.
Follow the pattern that we have in bio_alloc() to set the structure
members in the structure allocation function in alloc_cmd() and pass
bio to initialize newly allocated cmd->bio member.
Follow the pattern in copy_to_nullb() to use result of one function call
(null_cache_active()) to be used as a parameter to another function call
(null_insert_page()), use result of alloc_cmd() as a first parameter to
the null_handle_cmd() in null_submit_bio() function. This allow us to
remove the local variable cmd on stack in null_submit_bio() that is in
fast path.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20220216172945.31124-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chaitanya Kulkarni [Tue, 15 Feb 2022 21:33:10 +0000 (13:33 -0800)]
loop: allow user to set the queue depth
Instead of hardcoding queue depth allow user to set the hw queue depth
using module parameter. Set default value to 128 to retain the existing
behavior.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Link: https://lore.kernel.org/r/20220215213310.7264-5-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chaitanya Kulkarni [Tue, 15 Feb 2022 21:33:09 +0000 (13:33 -0800)]
loop: remove extra variable in lo_req_flush
The local variable file is used to pass it to the vfs_fsync(). We can
get away with using lo->lo_backing_file instead of storing in a local
variable which is not used anywhere else.
No functional change in this patch.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Link: https://lore.kernel.org/r/20220215213310.7264-4-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chaitanya Kulkarni [Tue, 15 Feb 2022 21:33:08 +0000 (13:33 -0800)]
loop: remove extra variable in lo_fallocate()
The local variable q is used to pass it to the blk_queue_discard(). We
can get away with using lo->lo_queue instead of storing in a local
variable which is not used anywhere else.
No functional change in this patch.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Link: https://lore.kernel.org/r/20220215213310.7264-3-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chaitanya Kulkarni [Tue, 15 Feb 2022 21:33:07 +0000 (13:33 -0800)]
loop: use sysfs_emit() in the sysfs xxx show()
sprintf does not know the PAGE_SIZE maximum of the temporary buffer
used for outputting sysfs content and it's possible to overrun the
PAGE_SIZE buffer length.
Use a generic sysfs_emit function that knows the size of the
temporary buffer and ensures that no overrun is done for offset
attribute in
loop_attr_[offset|sizelimit|autoclear|partscan|dio]_show() callbacks.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Link: https://lore.kernel.org/r/20220215213310.7264-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chaitanya Kulkarni [Tue, 15 Feb 2022 11:59:51 +0000 (03:59 -0800)]
null_blk: fix return value from null_add_dev()
The function nullb_device_power_store() returns -ENOMEM when
null_add_dev() fails. null_add_dev() can fail with return value
other than -ENOMEM such as -EINVAL when Zoned Block Device option
is used, see :
nullb_device_power_store()
null_add_dev()
null_init_zoned_dev()
return -EINVAL;
When trying to load the module having -ENOMEM value returned on the
command line creates confusion when pleanty of memory is free on the
machine.
Instead of hardcoding -ENOMEM return the value of null_add_dev()
function.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220215115951.15945-1-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Colin Ian King [Tue, 8 Feb 2022 11:46:56 +0000 (11:46 +0000)]
loop: clean up grammar in warning message
The phrase "has still" should be "still has" to clean up the grammar.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20220208114656.61629-1-colin.i.king@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christophe JAILLET [Mon, 7 Feb 2022 20:48:19 +0000 (21:48 +0100)]
block/rnbd: Remove a useless mutex
According to lib/idr.c,
The IDA handles its own locking. It is safe to call any of the IDA
functions without synchronisation in your code.
so the 'ida_lock' mutex can just be removed.
It is here only to protect some ida_simple_get()/ida_simple_remove() calls.
While at it, switch to ida_alloc_XXX()/ida_free() instead to
ida_simple_get()/ida_simple_remove().
The latter is deprecated and more verbose.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/7f9eccd8b1fce1bac45ac9b01a78cf72f54c0a61.1644266862.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Gioh Kim [Fri, 14 Jan 2022 15:58:55 +0000 (16:58 +0100)]
block/rnbd: client device does not care queue/rotational
On client side, the device is a network device. There is no reason
to set rotational even-if the target device on server is rotational.
Signed-off-by: Gioh Kim <gi-oh.kim@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Link: https://lore.kernel.org/r/20220114155855.984144-3-haris.iqbal@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Gioh Kim [Fri, 14 Jan 2022 15:58:54 +0000 (16:58 +0100)]
block/rnbd-clt: fix CHECK:BRACES warning
This patch fix the "CHECK:BRACES: braces {} should be used on all
arms of this statement" warning from checkpatch
Signed-off-by: Gioh Kim <gi-oh.kim@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Link: https://lore.kernel.org/r/20220114155855.984144-2-haris.iqbal@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 25 Feb 2022 18:14:40 +0000 (19:14 +0100)]
block: default BLOCK_LEGACY_AUTOLOAD to y
As Luis reported, losetup currently doesn't properly create the loop
device without this if the device node already exists because old
scripts created it manually. So default to y for now and remove the
aggressive removal schedule.
Reported-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220225181440.1351591-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Zhang Wensheng [Thu, 17 Feb 2022 06:42:47 +0000 (14:42 +0800)]
block: update io_ticks when io hang
When the inflight IOs are slow and no new IOs are issued, we expect
iostat could manifest the IO hang problem. However after
commit
5b18b5a73760 ("block: delete part_round_stats and switch to less
precise counting"), io_tick and time_in_queue will not be updated until
the end of IO, and the avgqu-sz and %util columns of iostat will be zero.
Because it has using stat.nsecs accumulation to express time_in_queue
which is not suitable to change, and may %util will express the status
better when io hang occur. To fix io_ticks, we use update_io_ticks and
inflight to update io_ticks when diskstats_show and part_stat_show
been called.
Fixes:
5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
Signed-off-by: Zhang Wensheng <zhangwensheng5@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220217064247.4041435-1-zhangwensheng5@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Sat, 29 Jan 2022 01:59:24 +0000 (09:59 +0800)]
block, bfq: don't move oom_bfqq
Our test report a UAF:
[ 2073.019181] ==================================================================
[ 2073.019188] BUG: KASAN: use-after-free in __bfq_put_async_bfqq+0xa0/0x168
[ 2073.019191] Write of size 8 at addr
ffff8000ccf64128 by task rmmod/72584
[ 2073.019192]
[ 2073.019196] CPU: 0 PID: 72584 Comm: rmmod Kdump: loaded Not tainted 4.19.90-yk #5
[ 2073.019198] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[ 2073.019200] Call trace:
[ 2073.019203] dump_backtrace+0x0/0x310
[ 2073.019206] show_stack+0x28/0x38
[ 2073.019210] dump_stack+0xec/0x15c
[ 2073.019216] print_address_description+0x68/0x2d0
[ 2073.019220] kasan_report+0x238/0x2f0
[ 2073.019224] __asan_store8+0x88/0xb0
[ 2073.019229] __bfq_put_async_bfqq+0xa0/0x168
[ 2073.019233] bfq_put_async_queues+0xbc/0x208
[ 2073.019236] bfq_pd_offline+0x178/0x238
[ 2073.019240] blkcg_deactivate_policy+0x1f0/0x420
[ 2073.019244] bfq_exit_queue+0x128/0x178
[ 2073.019249] blk_mq_exit_sched+0x12c/0x160
[ 2073.019252] elevator_exit+0xc8/0xd0
[ 2073.019256] blk_exit_queue+0x50/0x88
[ 2073.019259] blk_cleanup_queue+0x228/0x3d8
[ 2073.019267] null_del_dev+0xfc/0x1e0 [null_blk]
[ 2073.019274] null_exit+0x90/0x114 [null_blk]
[ 2073.019278] __arm64_sys_delete_module+0x358/0x5a0
[ 2073.019282] el0_svc_common+0xc8/0x320
[ 2073.019287] el0_svc_handler+0xf8/0x160
[ 2073.019290] el0_svc+0x10/0x218
[ 2073.019291]
[ 2073.019294] Allocated by task 14163:
[ 2073.019301] kasan_kmalloc+0xe0/0x190
[ 2073.019305] kmem_cache_alloc_node_trace+0x1cc/0x418
[ 2073.019308] bfq_pd_alloc+0x54/0x118
[ 2073.019313] blkcg_activate_policy+0x250/0x460
[ 2073.019317] bfq_create_group_hierarchy+0x38/0x110
[ 2073.019321] bfq_init_queue+0x6d0/0x948
[ 2073.019325] blk_mq_init_sched+0x1d8/0x390
[ 2073.019330] elevator_switch_mq+0x88/0x170
[ 2073.019334] elevator_switch+0x140/0x270
[ 2073.019338] elv_iosched_store+0x1a4/0x2a0
[ 2073.019342] queue_attr_store+0x90/0xe0
[ 2073.019348] sysfs_kf_write+0xa8/0xe8
[ 2073.019351] kernfs_fop_write+0x1f8/0x378
[ 2073.019359] __vfs_write+0xe0/0x360
[ 2073.019363] vfs_write+0xf0/0x270
[ 2073.019367] ksys_write+0xdc/0x1b8
[ 2073.019371] __arm64_sys_write+0x50/0x60
[ 2073.019375] el0_svc_common+0xc8/0x320
[ 2073.019380] el0_svc_handler+0xf8/0x160
[ 2073.019383] el0_svc+0x10/0x218
[ 2073.019385]
[ 2073.019387] Freed by task 72584:
[ 2073.019391] __kasan_slab_free+0x120/0x228
[ 2073.019394] kasan_slab_free+0x10/0x18
[ 2073.019397] kfree+0x94/0x368
[ 2073.019400] bfqg_put+0x64/0xb0
[ 2073.019404] bfqg_and_blkg_put+0x90/0xb0
[ 2073.019408] bfq_put_queue+0x220/0x228
[ 2073.019413] __bfq_put_async_bfqq+0x98/0x168
[ 2073.019416] bfq_put_async_queues+0xbc/0x208
[ 2073.019420] bfq_pd_offline+0x178/0x238
[ 2073.019424] blkcg_deactivate_policy+0x1f0/0x420
[ 2073.019429] bfq_exit_queue+0x128/0x178
[ 2073.019433] blk_mq_exit_sched+0x12c/0x160
[ 2073.019437] elevator_exit+0xc8/0xd0
[ 2073.019440] blk_exit_queue+0x50/0x88
[ 2073.019443] blk_cleanup_queue+0x228/0x3d8
[ 2073.019451] null_del_dev+0xfc/0x1e0 [null_blk]
[ 2073.019459] null_exit+0x90/0x114 [null_blk]
[ 2073.019462] __arm64_sys_delete_module+0x358/0x5a0
[ 2073.019467] el0_svc_common+0xc8/0x320
[ 2073.019471] el0_svc_handler+0xf8/0x160
[ 2073.019474] el0_svc+0x10/0x218
[ 2073.019475]
[ 2073.019479] The buggy address belongs to the object at
ffff8000ccf63f00
which belongs to the cache kmalloc-1024 of size 1024
[ 2073.019484] The buggy address is located 552 bytes inside of
1024-byte region [
ffff8000ccf63f00,
ffff8000ccf64300)
[ 2073.019486] The buggy address belongs to the page:
[ 2073.019492] page:
ffff7e000333d800 count:1 mapcount:0 mapping:
ffff8000c0003a00 index:0x0 compound_mapcount: 0
[ 2073.020123] flags: 0x7ffff0000008100(slab|head)
[ 2073.020403] raw:
07ffff0000008100 ffff7e0003334c08 ffff7e00001f5a08 ffff8000c0003a00
[ 2073.020409] raw:
0000000000000000 00000000001c001c 00000001ffffffff 0000000000000000
[ 2073.020411] page dumped because: kasan: bad access detected
[ 2073.020412]
[ 2073.020414] Memory state around the buggy address:
[ 2073.020420]
ffff8000ccf64000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2073.020424]
ffff8000ccf64080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2073.020428] >
ffff8000ccf64100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2073.020430] ^
[ 2073.020434]
ffff8000ccf64180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2073.020438]
ffff8000ccf64200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2073.020439] ==================================================================
The same problem exist in mainline as well.
This is because oom_bfqq is moved to a non-root group, thus root_group
is freed earlier.
Thus fix the problem by don't move oom_bfqq.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220129015924.3958918-4-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Sat, 29 Jan 2022 01:59:23 +0000 (09:59 +0800)]
block, bfq: avoid moving bfqq to it's parent bfqg
Moving bfqq to it's parent bfqg is pointless.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20220129015924.3958918-3-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Sat, 29 Jan 2022 01:59:22 +0000 (09:59 +0800)]
block, bfq: cleanup bfq_bfqq_to_bfqg()
Use bfq_group() instead, which do the same thing.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220129015924.3958918-2-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yahu Gao [Fri, 7 Jan 2022 06:58:59 +0000 (14:58 +0800)]
block/bfq_wf2q: correct weight to ioprio
The return value is ioprio * BFQ_WEIGHT_CONVERSION_COEFF or 0.
What we want is ioprio or 0.
Correct this by changing the calculation.
Signed-off-by: Yahu Gao <gaoyahu19@gmail.com>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220107065859.25689-1-gaoyahu19@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
David Jeffery [Mon, 31 Jan 2022 20:33:37 +0000 (15:33 -0500)]
blk-mq: avoid extending delays of active hctx from blk_mq_delay_run_hw_queues
When blk_mq_delay_run_hw_queues sets an hctx to run in the future, it can
reset the delay length for an already pending delayed work run_work. This
creates a scenario where multiple hctx may have their queues set to run,
but if one runs first and finds nothing to do, it can reset the delay of
another hctx and stall the other hctx's ability to run requests.
To avoid this I/O stall when an hctx's run_work is already pending,
leave it untouched to run at its current designated time rather than
extending its delay. The work will still run which keeps closed the race
calling blk_mq_delay_run_hw_queues is needed for while also avoiding the
I/O stall.
Signed-off-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220131203337.GA17666@redhat
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 15 Feb 2022 09:45:14 +0000 (10:45 +0100)]
virtio_blk: simplify refcounting
Implement the ->free_disk method to free the virtio_blk structure only
once the last gendisk reference goes away instead of keeping a local
refcount.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20220215094514.3828912-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 15 Feb 2022 09:45:13 +0000 (10:45 +0100)]
memstick/mspro_block: simplify refcounting
Implement the ->free_disk method to free the msb_data structure only once
the last gendisk reference goes away instead of keeping a local
refcount.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220215094514.3828912-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 15 Feb 2022 09:45:12 +0000 (10:45 +0100)]
memstick/mspro_block: fix handling of read-only devices
Use set_disk_ro to propagate the read-only state to the block layer
instead of checking for it in ->open and leaking a reference in case
of a read-only device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220215094514.3828912-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 15 Feb 2022 09:45:11 +0000 (10:45 +0100)]
memstick/ms_block: simplify refcounting
Implement the ->free_disk method to free the msb_data structure only once
the last gendisk reference goes away instead of keeping a local refcount.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220215094514.3828912-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 15 Feb 2022 09:45:10 +0000 (10:45 +0100)]
block: add a ->free_disk method
Add a method to notify the driver that the gendisk is about to be freed.
This allows drivers to tie the lifetime of their private data to that of
the gendisk and thus deal with device removal races without expensive
synchronization and boilerplate code.
A new flag is added so that ->free_disk is only called after a successful
call to add_disk, which significantly simplifies the error handling path
during probing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220215094514.3828912-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 16 Feb 2022 04:45:14 +0000 (12:45 +0800)]
block: revert
4f1e9630afe6 ("blk-throtl: optimize IOPS throttle for large IO scenarios")
Revert commit
4f1e9630afe6 ("blk-throtl: optimize IOPS throttle for large
IO scenarios") since we have another easier way to address this issue and
get better iops throttling result.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220216044514.2903784-9-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 16 Feb 2022 04:45:13 +0000 (12:45 +0800)]
block: don't try to throttle split bio if iops limit isn't set
We need to throttle split bio in case of IOPS limit even though the
split bio has been marked as BIO_THROTTLED since block layer
accounts split bio actually.
If only throughput throttle is setup, no need to throttle any more
if BIO_THROTTLED is set since we have accounted & considered the
whole bio bytes already.
Add one flag of THROTL_TG_HAS_IOPS_LIMIT for serving this purpose.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220216044514.2903784-8-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 16 Feb 2022 04:45:12 +0000 (12:45 +0800)]
block: throttle split bio in case of iops limit
Commit
111be8839817 ("block-throttle: avoid double charge") marks bio as
BIO_THROTTLED unconditionally if __blk_throtl_bio() is called on this bio,
then this bio won't be called into __blk_throtl_bio() any more. This way
is to avoid double charge in case of bio splitting. It is reasonable for
read/write throughput limit, but not reasonable for IOPS limit because
block layer provides io accounting against split bio.
Chunguang Xu has already observed this issue and fixed it in commit
4f1e9630afe6 ("blk-throtl: optimize IOPS throttle for large IO scenarios").
However, that patch only covers bio splitting in __blk_queue_split(), and
we have other kind of bio splitting, such as bio_split() &
submit_bio_noacct() and other ways.
This patch tries to fix the issue in one generic way by always charging
the bio for iops limit in blk_throtl_bio(). This way is reasonable:
re-submission & fast-cloned bio is charged if it is submitted to same
disk/queue, and BIO_THROTTLED will be cleared if bio->bi_bdev is changed.
This new approach can get much more smooth/stable iops limit compared with
commit
4f1e9630afe6 ("blk-throtl: optimize IOPS throttle for large IO
scenarios") since that commit can't throttle current split bios actually.
Also this way won't cause new double bio iops charge in
blk_throtl_dispatch_work_fn() in which blk_throtl_bio() won't be called
any more.
Reported-by: Ning Li <lining2020x@163.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Chunguang Xu <brookxu@tencent.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220216044514.2903784-7-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 16 Feb 2022 04:45:11 +0000 (12:45 +0800)]
block: merge submit_bio_checks() into submit_bio_noacct
Now submit_bio_checks() is only called by submit_bio_noacct(), so merge
it into submit_bio_noacct().
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220216044514.2903784-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 16 Feb 2022 04:45:10 +0000 (12:45 +0800)]
block: don't check bio in blk_throtl_dispatch_work_fn
The bio has been checked already before throttling, so no need to check
it again before dispatching it from throttle queue.
Add a helper of submit_bio_noacct_nocheck() for this purpose.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220216044514.2903784-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 16 Feb 2022 04:45:09 +0000 (12:45 +0800)]
block: don't declare submit_bio_checks in local header
submit_bio_checks() won't be called outside of block/blk-core.c any more
since commit
9d497e2941c3 ("block: don't protect submit_bio_checks by
q_usage_counter"), so mark it as one local helper.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220216044514.2903784-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 16 Feb 2022 04:45:08 +0000 (12:45 +0800)]
block: move blk_crypto_bio_prep() out of blk-mq.c
blk_crypto_bio_prep() is called for both bio based and blk-mq drivers,
so move it out of blk-mq.c, then we can unify this kind of handling.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220216044514.2903784-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 16 Feb 2022 04:45:07 +0000 (12:45 +0800)]
block: move submit_bio_checks() into submit_bio_noacct
It is more clean & readable to check bio when starting to submit it,
instead of just before calling ->submit_bio() or blk_mq_submit_bio().
Also it provides us chance to optimize bio submission without checking
bio.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220216044514.2903784-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 15 Feb 2022 10:05:40 +0000 (11:05 +0100)]
dm: remove dm_dispatch_clone_request
Fold dm_dispatch_clone_request into it's only caller, and use a switch
statement to single dispatch for the handling of the different return
values from blk_insert_cloned_request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220215100540.3892965-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 15 Feb 2022 10:05:39 +0000 (11:05 +0100)]
dm: remove useless code from dm_dispatch_clone_request
Both ->start_time_ns and the RQF_IO_STAT are set when the request is
allocated using blk_mq_alloc_request by dm-mpath in blk_mq_rq_ctx_init.
The block layer also ensures ->start_time_ns is only set when actually
needed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220215100540.3892965-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 15 Feb 2022 10:05:38 +0000 (11:05 +0100)]
blk-mq: remove the request_queue argument to blk_insert_cloned_request
The request must be submitted to the queue it was allocated for, so
remove the extra request_queue argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220215100540.3892965-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 15 Feb 2022 10:05:37 +0000 (11:05 +0100)]
blk-mq: fold blk_cloned_rq_check_limits into blk_insert_cloned_request
Fold blk_cloned_rq_check_limits into its only caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220215100540.3892965-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 15 Feb 2022 10:05:36 +0000 (11:05 +0100)]
blk-mq: make the blk-mq stacking code optional
The code to stack blk-mq drivers is only used by dm-multipath, and
will preferably stay that way. Make it optional and only selected
by device mapper, so that the buildbots more easily catch abuses
like the one that slipped in in the ufs driver in the last merged
window. Another positive side effects is that kernel builds without
device mapper shrink a little bit as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220215100540.3892965-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chengming Zhou [Sun, 13 Feb 2022 08:59:02 +0000 (16:59 +0800)]
blk-cgroup: set blkg iostat after percpu stat aggregation
Don't need to do blkg_iostat_set for top blkg iostat on each CPU,
so move it after percpu stat aggregation.
Fixes:
ef45fe470e1e ("blk-cgroup: show global disk stats in root cgroup io.stat")
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220213085902.88884-1-zhouchengming@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chaitanya Kulkarni [Tue, 15 Feb 2022 11:52:47 +0000 (03:52 -0800)]
blk-lib: don't check bdev_get_queue() NULL check
Based on the comment present in the bdev_get_queue()
bdev->bd_queue can never be NULL. Remove the NULL check for the local
variable q that is set from bdev_get_queue() for discard, write_same,
and write_zeroes.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220215115247.11717-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 15 Feb 2022 08:10:47 +0000 (09:10 +0100)]
block: remove biodoc.rst
This document is completely out of date and extremely misleading. In
general the existing kerneldoc comment serve as a much better
documentation of the still existing functionality, while the history
blurbs are pretty much irrelevant today.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220215081047.3693582-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Barry Song [Mon, 7 Feb 2022 07:49:31 +0000 (15:49 +0800)]
docs: block: biodoc.rst: Drop the obsolete and incorrect content
Since commit
7eaceaccab5f ("block: remove per-queue plugging"), kernel
has removed blk_run_address_space(), blk_unplug() and sync_buffer(),
and moved to on-stack plugging. The document has been obsolete for
years.
Given that there is no obvious counterparts in the new mechinism to
replace old APIs, this patch drops the content directly.
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Link: https://lore.kernel.org/r/20220207074931.20067-1-song.bao.hua@hisilicon.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 11 Feb 2022 10:11:49 +0000 (18:11 +0800)]
block: partition include/linux/blk-cgroup.h
Partition include/linux/blk-cgroup.h into two parts: one is public part,
the other is block layer private part.
Suggested by Christoph Hellwig.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220211101149.2368042-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 11 Feb 2022 10:11:48 +0000 (18:11 +0800)]
block: move initialization of q->blkg_list into blkcg_init_queue
q->blkg_list is only used by blkcg code, so move it into
blkcg_init_queue.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220211101149.2368042-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 11 Feb 2022 10:11:47 +0000 (18:11 +0800)]
block: remove THROTL_IOPS_MAX
No one uses THROTL_IOPS_MAX any more, so remove it.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220211101149.2368042-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yang Shi [Thu, 10 Feb 2022 22:52:22 +0000 (14:52 -0800)]
block: introduce block_rq_error tracepoint
Currently, rasdaemon uses the existing tracepoint block_rq_complete
and filters out non-error cases in order to capture block disk errors.
But there are a few problems with this approach:
1. Even kernel trace filter could do the filtering work, there is
still some overhead after we enable this tracepoint.
2. The filter is merely based on errno, which does not align with kernel
logic to check the errors for print_req_error().
3. block_rq_complete only provides dev major and minor to identify
the block device, it is not convenient to use in user-space.
So introduce a new tracepoint block_rq_error just for the error case.
With this patch, rasdaemon could switch to block_rq_error.
Since the new tracepoint has the similar implementation with
block_rq_complete, so move the existing code from TRACE_EVENT
block_rq_complete() into new event class block_rq_completion(). Then add
event for block_rq_complete and block_rq_err respectively from the newly
created event class per the suggestion from Chaitanya Kulkarni.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220210225222.260069-1-shy828301@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
John Garry [Tue, 8 Feb 2022 12:07:04 +0000 (20:07 +0800)]
sbitmap: Delete old sbitmap_queue_get_shallow()
Since __sbitmap_queue_get_shallow() was introduced in commit
c05e66733788
("sbitmap: add sbitmap_get_shallow() operation"), it has not been used.
Delete __sbitmap_queue_get_shallow() and rename public
__sbitmap_queue_get_shallow() -> sbitmap_queue_get_shallow() as it is odd
to have public __foo but no foo at all.
Signed-off-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/1644322024-105340-1-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 10 Jan 2022 07:29:45 +0000 (15:29 +0800)]
lib/sbitmap: kill 'depth' from sbitmap_word
Only the last sbitmap_word can have different depth, and all the others
must have same depth of 1U << sb->shift, so not necessary to store it in
sbitmap_word, and it can be retrieved easily and efficiently by adding
one internal helper of __map_depth(sb, index).
Remove 'depth' field from sbitmap_word, then the annotation of
____cacheline_aligned_in_smp for 'word' isn't needed any more.
Not see performance effect when running high parallel IOPS test on
null_blk.
This way saves us one cacheline(usually 64 words) per each sbitmap_word.
Cc: Martin Wilck <martin.wilck@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin Wilck <mwilck@suse.com>
Reviewed-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/20220110072945.347535-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 2 Feb 2022 16:01:09 +0000 (17:01 +0100)]
block: pass a block_device to bio_clone_fast
Pass a block_device to bio_clone_fast and __bio_clone_fast and give
the functions more suitable names.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220202160109.108149-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 2 Feb 2022 16:01:08 +0000 (17:01 +0100)]
block: initialize the target bio in __bio_clone_fast
All callers of __bio_clone_fast initialize the bio first. Move that
initialization into __bio_clone_fast instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220202160109.108149-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 2 Feb 2022 16:01:07 +0000 (17:01 +0100)]
dm: use bio_clone_fast in alloc_io/alloc_tio
Replace open coded bio_clone_fast implementations with the actual helper.
Note that the bio allocated as part of the dm_io structure in alloc_io
will only actually be used later in alloc_tio, making this earlier
cloning of the information safe.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220202160109.108149-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 2 Feb 2022 16:01:06 +0000 (17:01 +0100)]
block: clone crypto and integrity data in __bio_clone_fast
__bio_clone_fast should also clone integrity and crypto data, as a clone
without those is incomplete. Right now the only caller that can actually
support crypto and integrity data (dm) does it manually for the one
callchain that supports these, but we better do it properly in the core.
Note that all callers except for the above mentioned one also don't need
to handle failure at all, given that the integrity and crypto clones are
based on mempool allocations that won't fail for sleeping allocations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220202160109.108149-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 2 Feb 2022 16:01:05 +0000 (17:01 +0100)]
dm-cache: remove __remap_to_origin_clear_discard
Fold __remap_to_origin_clear_discard into the two callers to prepare
for bio cloning refactoring.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220202160109.108149-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 2 Feb 2022 16:01:04 +0000 (17:01 +0100)]
dm: simplify the single bio fast path in __send_duplicate_bios
Most targets just need a single flush bio. Open code that case in
__send_duplicate_bios without the need to add the bio to a list.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220202160109.108149-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 2 Feb 2022 16:01:03 +0000 (17:01 +0100)]
dm: retun the clone bio from alloc_tio
Return the clone bio embedded into the tio as that is what the callers
actually want. Similar for the free side.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220202160109.108149-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 2 Feb 2022 16:01:02 +0000 (17:01 +0100)]
dm: pass the bio instead of tio to __map_bio
This simplifies the callers a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220202160109.108149-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 2 Feb 2022 16:01:01 +0000 (17:01 +0100)]
dm: move cloning the bio into alloc_tio
Move the call to __bio_clone_fast and the assignment of ->len_ptr from
the callers into alloc_tio to prepare for changes to the bio clone API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220202160109.108149-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 2 Feb 2022 16:01:00 +0000 (17:01 +0100)]
dm: fold __send_duplicate_bios into __clone_and_map_simple_bio
Fold __send_duplicate_bios into its only caller to prepare for
refactoring.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220202160109.108149-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 2 Feb 2022 16:00:59 +0000 (17:00 +0100)]
dm: fold clone_bio into __clone_and_map_data_bio
Fold clone_bio into its only caller to prepare for refactoring.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220202160109.108149-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 2 Feb 2022 16:00:58 +0000 (17:00 +0100)]
dm: add a clone_to_tio helper
Add a helper to stop open coding the container_of operations to get
from the clone bio to the tio structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220202160109.108149-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 2 Feb 2022 16:00:57 +0000 (17:00 +0100)]
drbd: set ->bi_bdev in drbd_req_new
Make sure the newly allocated bio has the correct bi_bdev set from the
start.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220202160109.108149-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 4 Feb 2022 07:19:34 +0000 (08:19 +0100)]
block: call bio_associate_blkg from bio_reset
Call bio_associate_blkg just like bio_set_dev did in the callers before
the conversion to set the block device in bio_reset.
Fixes:
a7c50c940477 ("block: pass a block_device and opf to bio_reset")
Reported-by: syzbot+2b3f18414c37b42dcc94@syzkaller.appspotmail.com
Tested-by: syzbot+2b3f18414c37b42dcc94@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220204071934.168469-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Song Liu [Thu, 3 Feb 2022 19:28:27 +0000 (11:28 -0800)]
scsi: use BLK_STS_OFFLINE for not fully online devices
The new error message for such case looks like
[ 172.809565] device offline error, dev sda, sector 3138208 ...
which will not be confused with regular I/O error (BLK_STS_IOERR).
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Song Liu <song@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20220203192827.1370270-4-song@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Song Liu [Thu, 3 Feb 2022 19:28:26 +0000 (11:28 -0800)]
block: return -ENODEV for BLK_STS_OFFLINE
Change the user visible return value for BLK_STS_OFFLINE to -ENODEV, which
is more descriptive than existing -EIO.
Signed-off-by: Song Liu <song@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20220203192827.1370270-3-song@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Song Liu [Thu, 3 Feb 2022 19:28:25 +0000 (11:28 -0800)]
block: introduce BLK_STS_OFFLINE
Currently, drivers reports BLK_STS_IOERR for devices that are not full
online or being removed. This behavior could cause confusion for users,
as they are not really I/O errors from the device.
Solve this issue with a new state BLK_STS_OFFLINE, which reports "device
offline error" in dmesg instead of "I/O error".
EIO is intentionally kept to not change user visible return value.
Signed-off-by: Song Liu <song@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20220203192827.1370270-2-song@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dan Carpenter [Fri, 28 Jan 2022 14:09:22 +0000 (17:09 +0300)]
fs/ntfs3: remove unnecessary NULL check
This code triggers a Smatch warning:
fs/ntfs3/fsntfs.c:1606 ntfs_bio_fill_1()
warn: variable dereferenced before check 'bio' (see line 1591)
The "bio" pointer cannot be NULL so there is no need to check.
Originally there was more extensive NULL checking but it was removed
because bio_alloc() will never fail if it is allowed to sleep.
Remove this check as well.
Fixes:
39146b6f66ba ("ntfs3: remove ntfs_alloc_bio")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220128140922.GA29766@kili
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jiapeng Chong [Fri, 28 Jan 2022 04:34:54 +0000 (12:34 +0800)]
block: fix boolreturn.cocci warning
Return statements in functions returning bool should use true/false
instead of 1/0.
./block/bio.c:1081:9-10: WARNING: return of 0/1 in function
'bio_add_folio' with return type bool.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220128043454.68927-1-jiapeng.chong@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 27 Jan 2022 06:42:21 +0000 (07:42 +0100)]
MAINTAINERS: add bio.h to the block section
bio.h is part of the block layer, so list it in the MAINTAINERS file
as such.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220127064221.1314477-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 27 Jan 2022 06:41:25 +0000 (07:41 +0100)]
block: fix the kerneldoc for bio_end_io_acct
Document the actually existing parameter name.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220127064125.1314347-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 27 Jan 2022 07:05:49 +0000 (08:05 +0100)]
block: check that there is a plug in blk_flush_plug
Rename blk_flush_plug to __blk_flush_plug and add a wrapper that includes
the NULL check instead of open coding that check everywhere.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220127070549.1377856-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 27 Jan 2022 07:05:48 +0000 (08:05 +0100)]
block: remove blk_needs_flush_plug
blk_needs_flush_plug fails to account for the cb_list, which needs
flushing as well. Remove it and just check if there is a plug instead
of poking into the internals of the plug structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220127070549.1377856-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 24 Jan 2022 09:11:07 +0000 (10:11 +0100)]
block: pass a block_device and opf to bio_reset
Pass the block_device that we plan to use this bio for and the
operation to bio_reset to optimize the assigment. A NULL block_device
can be passed, both for the passthrough case on a raw request_queue and
to temporarily avoid refactoring some nasty code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-20-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 24 Jan 2022 09:11:06 +0000 (10:11 +0100)]
block: pass a block_device and opf to bio_init
Pass the block_device that we plan to use this bio for and the
operation to bio_init to optimize the assignment. A NULL block_device
can be passed, both for the passthrough case on a raw request_queue and
to temporarily avoid refactoring some nasty code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-19-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 24 Jan 2022 09:11:05 +0000 (10:11 +0100)]
block: pass a block_device and opf to bio_alloc
Pass the block_device and operation that we plan to use this bio for to
bio_alloc to optimize the assignment. NULL/0 can be passed, both for the
passthrough case on a raw request_queue and to temporarily avoid
refactoring some nasty code.
Also move the gfp_mask argument after the nr_vecs argument for a much
more logical calling convention matching what most of the kernel does.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 24 Jan 2022 09:11:04 +0000 (10:11 +0100)]
block: pass a block_device and opf to bio_alloc_kiocb
Pass the block_device and operation that we plan to use this bio for to
bio_alloc_kiocb to optimize the assigment.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 24 Jan 2022 09:11:03 +0000 (10:11 +0100)]
block: pass a block_device and opf to bio_alloc_bioset
Pass the block_device and operation that we plan to use this bio for to
bio_alloc_bioset to optimize the assigment. NULL/0 can be passed, both
for the passthrough case on a raw request_queue and to temporarily avoid
refactoring some nasty code.
Also move the gfp_mask argument after the nr_vecs argument for a much
more logical calling convention matching what most of the kernel does.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chaitanya Kulkarni [Mon, 24 Jan 2022 09:11:02 +0000 (10:11 +0100)]
block: pass a block_device and opf to blk_next_bio
All callers need to set the block_device and operation, so lift that into
the common code.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220124091107.642561-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 24 Jan 2022 09:11:01 +0000 (10:11 +0100)]
block: move blk_next_bio to bio.c
Keep blk_next_bio next to the core bio infrastructure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 24 Jan 2022 09:11:00 +0000 (10:11 +0100)]
xen-blkback: bio_alloc can't fail if it is allow to sleep
Remove handling of NULL returns from sleeping bio_alloc calls given that
those can't fail.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220124091107.642561-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 24 Jan 2022 09:10:59 +0000 (10:10 +0100)]
rnbd-srv: remove struct rnbd_dev_blk_io
Only the priv field of rnbd_dev_blk_io is used, so store the value of
that in bio->bi_private directly and remove the entire bio_set overhead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20220124091107.642561-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 24 Jan 2022 09:10:58 +0000 (10:10 +0100)]
rnbd-srv: simplify bio mapping in process_rdma
The memory mapped in process_rdma is contiguous, so there is no need
to loop over bio_add_page. Remove rnbd_bio_map_kern and just open code
the bio allocation and mapping in the caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jack Wang <jinpu.wang@ionons.com>
Tested-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20220124091107.642561-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>