platform/kernel/linux-starfive.git
3 years agonvmet: use struct_size over open coded arithmetic
Len Baker [Sun, 17 Oct 2021 09:56:50 +0000 (11:56 +0200)]
nvmet: use struct_size over open coded arithmetic

As noted in the "Deprecated Interfaces, Language Features, Attributes,
and Conventions" documentation [1], size calculations (especially
multiplication) should not be performed in memory allocator (or similar)
function arguments due to the risk of them overflowing. This could lead
to values wrapping around and a smaller allocation being made than the
caller was expecting. Using those allocations could lead to linear
overflows of heap memory and other misbehaviors.

In this case this is not actually dynamic size: all the operands
involved in the calculation are constant values. However it is better to
refactor this anyway, just to keep the open-coded math idiom out of
code.

So, use the struct_size() helper to do the arithmetic instead of the
argument "size + count * size" in the kmalloc() function.

This code was detected with the help of Coccinelle and audited and fixed
manually.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments

Signed-off-by: Len Baker <len.baker@gmx.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme: drop scan_lock and always kick requeue list when removing namespaces
Hannes Reinecke [Wed, 20 Oct 2021 05:59:10 +0000 (07:59 +0200)]
nvme: drop scan_lock and always kick requeue list when removing namespaces

When reading the partition table on initial scan hits an I/O error the
I/O will hang with the scan_mutex held:

[<0>] do_read_cache_page+0x49b/0x790
[<0>] read_part_sector+0x39/0xe0
[<0>] read_lba+0xf9/0x1d0
[<0>] efi_partition+0xf1/0x7f0
[<0>] bdev_disk_changed+0x1ee/0x550
[<0>] blkdev_get_whole+0x81/0x90
[<0>] blkdev_get_by_dev+0x128/0x2e0
[<0>] device_add_disk+0x377/0x3c0
[<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
[<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
[<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
[<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
[<0>] nvme_scan_work+0x168/0x310 [nvme_core]
[<0>] process_one_work+0x231/0x420

and trying to delete the controller will deadlock as it tries to grab
the scan mutex:

[<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
[<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
[<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]

As we're now properly ordering the namespace list there is no need to
hold the scan_mutex in nvme_mpath_clear_ctrl_paths() anymore.
And we always need to kick the requeue list as the path will be marked
as unusable and I/O will be requeued _without_ a current path.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme-pci: clear shadow doorbell memory on resets
Keith Busch [Thu, 14 Oct 2021 16:45:42 +0000 (09:45 -0700)]
nvme-pci: clear shadow doorbell memory on resets

The host memory doorbell and event buffers need to be initialized on
each reset so the driver doesn't observe stale values from the previous
instantiation.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Tested-by: John Levon <john.levon@nutanix.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme-rdma: fix error code in nvme_rdma_setup_ctrl
Max Gurtovoy [Sun, 17 Oct 2021 08:58:16 +0000 (11:58 +0300)]
nvme-rdma: fix error code in nvme_rdma_setup_ctrl

In case that icdoff is not zero or mandatory keyed sgls are not
supported by the NVMe/RDMA target, we'll go to error flow but we'll
return 0 to the caller. Fix it by returning an appropriate error code.

Fixes: c66e2998c8ca ("nvme-rdma: centralize controller setup sequence")
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme-multipath: add error handling support for add_disk()
Luis Chamberlain [Fri, 15 Oct 2021 23:52:08 +0000 (16:52 -0700)]
nvme-multipath: add error handling support for add_disk()

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Since we now can tell for sure when a disk was added, move
setting the bit NVME_NSHEAD_DISK_LIVE only when we did
add the disk successfully.

Nothing to do here as the cleanup is done elsewhere. We take
care and use test_and_set_bit() because it is protects against
two nvme paths simultaneously calling device_add_disk() on the
same namespace head.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvmet: use macro definitions for setting cmic value
Max Gurtovoy [Thu, 23 Sep 2021 10:17:44 +0000 (13:17 +0300)]
nvmet: use macro definitions for setting cmic value

This makes the code more readable.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvmet: use macro definition for setting nmic value
Max Gurtovoy [Thu, 23 Sep 2021 10:17:43 +0000 (13:17 +0300)]
nvmet: use macro definition for setting nmic value

This makes the code more readable.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme: display correct subsystem NQN
Hannes Reinecke [Wed, 22 Sep 2021 06:35:25 +0000 (08:35 +0200)]
nvme: display correct subsystem NQN

With discovery controllers supporting unique subsystem NQNs the
actual subsystem NQN might be different from that one passed in
via the connect args. So add a helper to display the resulting
subsystem NQN.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme: Add connect option 'discovery'
Hannes Reinecke [Wed, 22 Sep 2021 06:35:24 +0000 (08:35 +0200)]
nvme: Add connect option 'discovery'

Add a connect option 'discovery' to specify that the connection
should be made to a discovery controller, not a normal I/O controller.
With discovery controllers supporting unique subsystem NQNs we
cannot easily distinguish by the subsystem NQN if this should be
a discovery connection, but we need this information to blank out
options not supported by discovery controllers.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme: expose subsystem type in sysfs attribute 'subsystype'
Hannes Reinecke [Wed, 22 Sep 2021 06:35:23 +0000 (08:35 +0200)]
nvme: expose subsystem type in sysfs attribute 'subsystype'

With unique discovery controller NQNs we cannot distinguish the
subsystem type by the NQN alone, but need to check the subsystem
type, too.
So expose the subsystem type in a new sysfs attribute 'subsystype'.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvmet: set 'CNTRLTYPE' in the identify controller data
Hannes Reinecke [Wed, 22 Sep 2021 06:35:22 +0000 (08:35 +0200)]
nvmet: set 'CNTRLTYPE' in the identify controller data

Set the correct 'CNTRLTYPE' field in the identify controller data.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvmet: add nvmet_is_disc_subsys() helper
Hannes Reinecke [Wed, 22 Sep 2021 06:35:21 +0000 (08:35 +0200)]
nvmet: add nvmet_is_disc_subsys() helper

Add a helper function to determine if a given subsystem is a discovery
subsystem.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme: add CNTRLTYPE definitions for 'identify controller'
Hannes Reinecke [Wed, 22 Sep 2021 06:35:20 +0000 (08:35 +0200)]
nvme: add CNTRLTYPE definitions for 'identify controller'

Update the 'identify controller' structure to define the newly added
CNTRLTYPE field.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvmet: make discovery NQN configurable
Hannes Reinecke [Wed, 22 Sep 2021 06:35:19 +0000 (08:35 +0200)]
nvmet: make discovery NQN configurable

TPAR8013 allows for unique discovery NQNs, so make the discovery
controller NQN configurable by exposing a subsys attribute
'discovery_nqn'.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvmet-rdma: implement get_max_queue_size controller op
Max Gurtovoy [Wed, 22 Sep 2021 21:55:37 +0000 (00:55 +0300)]
nvmet-rdma: implement get_max_queue_size controller op

Limit the maximal queue size for RDMA controllers. Today, the target
reports a limit of 1024 and this limit isn't valid for some of the RDMA
based controllers. For now, limit RDMA transport to 128 entries (the
max queue depth configured for Linux NVMe/RDMA host).

Future general solution should use RDMA/core API to calculate this size
according to device capabilities and number of WRs needed per NVMe IO
request.

Reported-by: Mark Ruijter <mruijter@primelogic.nl>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvmet: add get_max_queue_size op for controllers
Max Gurtovoy [Wed, 22 Sep 2021 21:55:36 +0000 (00:55 +0300)]
nvmet: add get_max_queue_size op for controllers

Some transports, such as RDMA, would like to set the queue size
according to device/port/ctrl characteristics. Add a new nvmet transport
op that is called during ctrl initialization. This will not effect
transports that don't implement this option.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme-rdma: limit the maximal queue size for RDMA controllers
Max Gurtovoy [Wed, 22 Sep 2021 21:55:35 +0000 (00:55 +0300)]
nvme-rdma: limit the maximal queue size for RDMA controllers

Corrent limit of 1024 isn't valid for some of the RDMA based ctrls. In
case the target expose a cap of larger amount of entries (e.g. 1024),
the initiator may fail to create a QP with this size. Thus limit to a
value that works for all RDMA adapters.

Future general solution should use RDMA/core API to calculate this size
according to device capabilities and number of WRs needed per NVMe IO
request.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvmet-tcp: fix use-after-free when a port is removed
Israel Rukshin [Wed, 6 Oct 2021 08:09:45 +0000 (08:09 +0000)]
nvmet-tcp: fix use-after-free when a port is removed

When removing a port, all its controllers are being removed, but there
are queues on the port that doesn't belong to any controller (during
connection time). This causes a use-after-free bug for any command
that dereferences req->port (like in nvmet_alloc_ctrl). Those queues
should be destroyed before freeing the port via configfs. Destroy
the remaining queues after the accept_work was cancelled guarantees
that no new queue will be created.

Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvmet-rdma: fix use-after-free when a port is removed
Israel Rukshin [Wed, 6 Oct 2021 08:09:44 +0000 (08:09 +0000)]
nvmet-rdma: fix use-after-free when a port is removed

When removing a port, all its controllers are being removed, but there
are queues on the port that doesn't belong to any controller (during
connection time). This causes a use-after-free bug for any command
that dereferences req->port (like in nvmet_alloc_ctrl). Those queues
should be destroyed before freeing the port via configfs. Destroy the
remaining queues after the RDMA-CM was destroyed guarantees that no
new queue will be created.

Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvmet: fix use-after-free when a port is removed
Israel Rukshin [Wed, 6 Oct 2021 08:09:43 +0000 (08:09 +0000)]
nvmet: fix use-after-free when a port is removed

When a port is removed through configfs, any connected controllers
are starting teardown flow asynchronously and can still send commands.
This causes a use-after-free bug for any command that dereferences
req->port (like in nvmet_parse_io_cmd).

To fix this, wait for all the teardown scheduled works to complete
(like release_work at rdma/tcp drivers). This ensures there are no
active controllers when the port is eventually removed.

Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agoqla2xxx: add ->map_queues support for nvme
Saurav Kashyap [Mon, 23 Aug 2021 12:56:49 +0000 (05:56 -0700)]
qla2xxx: add ->map_queues support for nvme

Implement ->map queues and use the block layer blk_mq_pci_map_queues
helper for mapping queues to CPUs.

With this mapping minimum 10%+ increase in performance is noticed.

Signed-off-by: Saurav Kashyap <skashyap@marvell.com>
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme-fc: add support for ->map_queues
Saurav Kashyap [Mon, 23 Aug 2021 12:56:48 +0000 (05:56 -0700)]
nvme-fc: add support for ->map_queues

NVMe FC don't have support for map queues, unlike the PCI, RDMA and TCP
transports.  Add a ->map_queues callout for the LLDDs to provide such
functionality.

Signed-off-by: Saurav Kashyap <skashyap@marvell.com>
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme: generate uevent once a multipath namespace is operational again
Hannes Reinecke [Wed, 6 Oct 2021 09:13:13 +0000 (11:13 +0200)]
nvme: generate uevent once a multipath namespace is operational again

When fast_io_fail_tmo is set I/O will be aborted while recovery is
still ongoing. This causes MD to set the namespace to failed, and
no futher I/O will be submitted to that namespace.

However, once the recovery succeeds and the namespace becomes
operational again the NVMe subsystem doesn't send a notification,
so MD cannot automatically reinstate operation and requires
manual interaction.

This patch will send a KOBJ_CHANGE uevent per multipathed namespace
once the underlying controller transitions to LIVE, allowing an automatic
MD reassembly with these udev rules:

/etc/udev/rules.d/65-md-auto-re-add.rules:
SUBSYSTEM!="block", GOTO="md_end"

ACTION!="change", GOTO="md_end"
ENV{ID_FS_TYPE}!="linux_raid_member", GOTO="md_end"
PROGRAM="/sbin/md_raid_auto_readd.sh $devnode"
LABEL="md_end"

/sbin/md_raid_auto_readd.sh:

MDADM=/sbin/mdadm
DEVNAME=$1

export $(${MDADM} --examine --export ${DEVNAME})

if [ -z "${MD_UUID}" ]; then
    exit 1
fi

UUID_LINK=$(readlink /dev/disk/by-id/md-uuid-${MD_UUID})
MD_DEVNAME=${UUID_LINK##*/}
export $(${MDADM} --detail --export /dev/${MD_DEVNAME})
if [ -z "${MD_METADATA}" ] ; then
    exit 1
fi
if [ $(cat /sys/block/${MD_DEVNAME}/md/degraded) != 1 ]; then
    echo "${MD_DEVNAME}: array not degraded, nothing to do"
    exit 0
fi
MD_STATE=$(cat /sys/block/${MD_DEVNAME}/md/array_state)
if [ ${MD_STATE} != "clean" ] ; then
    echo "${MD_DEVNAME}: array state ${MD_STATE}, cannot re-add"
    exit 1
fi
MD_VARNAME="MD_DEVICE_dev_${DEVNAME##*/}_ROLE"
if [ ${!MD_VARNAME} = "spare" ] ; then
    ${MDADM} --manage /dev/${MD_DEVNAME} --re-add ${DEVNAME}
fi

Changes to v2:
- Add udev rules example to description
Changes to v1:
- use disk_uevent() as suggested by hch

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme: don't memset() the normal read/write command
Jens Axboe [Mon, 18 Oct 2021 12:47:18 +0000 (06:47 -0600)]
nvme: don't memset() the normal read/write command

This memset in the fast path costs a lot of cycles on my setup. Here's a
top-of-profile of doing ~6.7M IOPS:

+    5.90%  io_uring  [nvme]            [k] nvme_queue_rq
+    5.32%  io_uring  [nvme_core]       [k] nvme_setup_cmd
+    5.17%  io_uring  [kernel.vmlinux]  [k] io_submit_sqes
+    4.97%  io_uring  [kernel.vmlinux]  [k] blkdev_direct_IO

and a perf diff with this patch:

     0.92%     +4.40%  [nvme_core]       [k] nvme_setup_cmd

reducing it from 5.3% to only 0.9%. This takes it from the 2nd most
cycle consumer to something that's mostly irrelevant.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonvme: move command clear into the various setup helpers
Jens Axboe [Mon, 18 Oct 2021 12:45:06 +0000 (06:45 -0600)]
nvme: move command clear into the various setup helpers

We don't have to worry about doing extra memsets by moving it outside
the protection of RQF_DONTPREP, as nvme doesn't do partial completions.

This is in preparation for making the read/write fast path not do a full
memset of the command.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: ataflop: fix breakage introduced at blk-mq refactoring
Michael Schmitz [Tue, 19 Oct 2021 06:13:21 +0000 (19:13 +1300)]
block: ataflop: fix breakage introduced at blk-mq refactoring

Refactoring of the Atari floppy driver when converting to blk-mq
has broken the state machine in not-so-subtle ways:

finish_fdc() must be called when operations on the floppy device
have completed. This is crucial in order to relase the ST-DMA
lock, which protects against concurrent access to the ST-DMA
controller by other drivers (some DMA related, most just related
to device register access - broken beyond compare, I know).

When rewriting the driver's old do_request() function, the fact
that finish_fdc() was called only when all queued requests had
completed appears to have been overlooked. Instead, the new
request function calls finish_fdc() immediately after the last
request has been queued. finish_fdc() executes a dummy seek after
most requests, and this overwrites the state machine's interrupt
hander that was set up to wait for completion of the read/write
request just prior. To make matters worse, finish_fdc() is called
before device interrupts are re-enabled, making certain that the
read/write interupt is missed.

Shifting the finish_fdc() call into the read/write request
completion handler ensures the driver waits for the request to
actually complete. With a queue depth of 2, we won't see long
request sequences, so calling finish_fdc() unconditionally just
adds a little overhead for the dummy seeks, and keeps the code
simple.

While we're at it, kill ataflop_commit_rqs() which does nothing
but run finish_fdc() unconditionally, again likely wiping out an
in-flight request.

Signed-off-by: Michael Schmitz <schmitzmic@gmail.com>
Fixes: 6ec3938cff95 ("ataflop: convert to blk-mq")
CC: linux-block@vger.kernel.org
CC: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Link: https://lore.kernel.org/r/20211019061321.26425-1-schmitzmic@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonbd: fix uaf in nbd_handle_reply()
Yu Kuai [Thu, 16 Sep 2021 14:18:10 +0000 (22:18 +0800)]
nbd: fix uaf in nbd_handle_reply()

There is a problem that nbd_handle_reply() might access freed request:

1) At first, a normal io is submitted and completed with scheduler:

internel_tag = blk_mq_get_tag -> get tag from sched_tags
 blk_mq_rq_ctx_init
  sched_tags->rq[internel_tag] = sched_tag->static_rq[internel_tag]
...
blk_mq_get_driver_tag
 __blk_mq_get_driver_tag -> get tag from tags
 tags->rq[tag] = sched_tag->static_rq[internel_tag]

So, both tags->rq[tag] and sched_tags->rq[internel_tag] are pointing
to the request: sched_tags->static_rq[internal_tag]. Even if the
io is finished.

2) nbd server send a reply with random tag directly:

recv_work
 nbd_handle_reply
  blk_mq_tag_to_rq(tags, tag)
   rq = tags->rq[tag]

3) if the sched_tags->static_rq is freed:

blk_mq_sched_free_requests
 blk_mq_free_rqs(q->tag_set, hctx->sched_tags, i)
  -> step 2) access rq before clearing rq mapping
  blk_mq_clear_rq_mapping(set, tags, hctx_idx);
  __free_pages() -> rq is freed here

4) Then, nbd continue to use the freed request in nbd_handle_reply

Fix the problem by get 'q_usage_counter' before blk_mq_tag_to_rq(),
thus request is ensured not to be freed because 'q_usage_counter' is
not zero.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210916141810.2325276-1-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonbd: partition nbd_read_stat() into nbd_read_reply() and nbd_handle_reply()
Yu Kuai [Thu, 16 Sep 2021 09:33:49 +0000 (17:33 +0800)]
nbd: partition nbd_read_stat() into nbd_read_reply() and nbd_handle_reply()

Prepare to fix uaf in nbd_read_stat(), no functional changes.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20210916093350.1410403-7-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonbd: clean up return value checking of sock_xmit()
Yu Kuai [Thu, 16 Sep 2021 09:33:48 +0000 (17:33 +0800)]
nbd: clean up return value checking of sock_xmit()

Check if sock_xmit() return 0 is useless because it'll never return
0, comment it and remove such checkings.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20210916093350.1410403-6-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonbd: don't start request if nbd_queue_rq() failed
Yu Kuai [Thu, 16 Sep 2021 09:33:47 +0000 (17:33 +0800)]
nbd: don't start request if nbd_queue_rq() failed

commit 6a468d5990ec ("nbd: don't start req until after the dead
connection logic") move blk_mq_start_request() from nbd_queue_rq()
to nbd_handle_cmd() to skip starting request if the connection is
dead. However, request is still started in other error paths.

Currently, blk_mq_end_request() will be called immediately if
nbd_queue_rq() failed, thus start request in such situation is
useless. So remove blk_mq_start_request() from error paths in
nbd_handle_cmd().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20210916093350.1410403-5-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonbd: check sock index in nbd_read_stat()
Yu Kuai [Thu, 16 Sep 2021 09:33:46 +0000 (17:33 +0800)]
nbd: check sock index in nbd_read_stat()

The sock that clent send request in nbd_send_cmd() and receive reply
in nbd_read_stat() should be the same.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20210916093350.1410403-4-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonbd: make sure request completion won't concurrent
Yu Kuai [Thu, 16 Sep 2021 09:33:45 +0000 (17:33 +0800)]
nbd: make sure request completion won't concurrent

commit cddce0116058 ("nbd: Aovid double completion of a request")
try to fix that nbd_clear_que() and recv_work() can complete a
request concurrently. However, the problem still exists:

t1                    t2                     t3

nbd_disconnect_and_put
 flush_workqueue
                      recv_work
                       blk_mq_complete_request
                        blk_mq_complete_request_remote -> this is true
                         WRITE_ONCE(rq->state, MQ_RQ_COMPLETE)
                          blk_mq_raise_softirq
                                             blk_done_softirq
                                              blk_complete_reqs
                                               nbd_complete_rq
                                                blk_mq_end_request
                                                 blk_mq_free_request
                                                  WRITE_ONCE(rq->state, MQ_RQ_IDLE)
  nbd_clear_que
   blk_mq_tagset_busy_iter
    nbd_clear_req
                                                   __blk_mq_free_request
                                                    blk_mq_put_tag
     blk_mq_complete_request -> complete again

There are three places where request can be completed in nbd:
recv_work(), nbd_clear_que() and nbd_xmit_timeout(). Since they
all hold cmd->lock before completing the request, it's easy to
avoid the problem by setting and checking a cmd flag.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20210916093350.1410403-3-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonbd: don't handle response without a corresponding request message
Yu Kuai [Thu, 16 Sep 2021 09:33:44 +0000 (17:33 +0800)]
nbd: don't handle response without a corresponding request message

While handling a response message from server, nbd_read_stat() will
try to get request by tag, and then complete the request. However,
this is problematic if nbd haven't sent a corresponding request
message:

t1                      t2
                        submit_bio
                         nbd_queue_rq
                          blk_mq_start_request
recv_work
 nbd_read_stat
  blk_mq_tag_to_rq
 blk_mq_complete_request
                          nbd_send_cmd

Thus add a new cmd flag 'NBD_CMD_INFLIGHT', it will be set in
nbd_send_cmd() and checked in nbd_read_stat().

Noted that this patch can't fix that blk_mq_tag_to_rq() might
return a freed request, and this will be fixed in following
patches.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20210916093350.1410403-2-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agomtip32xx: Remove redundant 'flush_workqueue()' calls
Christophe JAILLET [Thu, 14 Oct 2021 18:07:50 +0000 (20:07 +0200)]
mtip32xx: Remove redundant 'flush_workqueue()' calls

'destroy_workqueue()' already drains the queue before destroying it, so
there is no need to flush it explicitly.

Remove the redundant 'flush_workqueue()' calls.

This was generated with coccinelle:

@@
expression E;
@@
-  flush_workqueue(E);
destroy_workqueue(E);

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/0fea349c808c6cfbf549b0e33701320c7860c8b7.1634234221.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agomd: update superblock after changing rdev flags in state_store
Xiao Ni [Wed, 13 Oct 2021 14:59:33 +0000 (22:59 +0800)]
md: update superblock after changing rdev flags in state_store

When the in memory flag is changed, we need to persist the change in the
rdev superblock flags. This is needed for "writemostly" and "failfast".

Reviewed-by: Li Feng <fengli@smartx.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agomd: remove unused argument from md_new_event
Guoqing Jiang [Mon, 4 Oct 2021 15:34:53 +0000 (23:34 +0800)]
md: remove unused argument from md_new_event

Actually, mddev is not used by md_new_event.

Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agomd/raid5: call roundup_pow_of_two in raid5_run
Guoqing Jiang [Mon, 4 Oct 2021 15:34:52 +0000 (23:34 +0800)]
md/raid5: call roundup_pow_of_two in raid5_run

Let's call roundup_pow_of_two here instead of open code.

Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agomd/raid1: use rdev in raid1_write_request directly
Guoqing Jiang [Mon, 4 Oct 2021 15:34:50 +0000 (23:34 +0800)]
md/raid1: use rdev in raid1_write_request directly

We already get rdev from conf->mirrors[i].rdev at the beginning of the
loop, so just use it.

Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agomd/raid1: only allocate write behind bio for WriteMostly device
Guoqing Jiang [Mon, 4 Oct 2021 15:34:48 +0000 (23:34 +0800)]
md/raid1: only allocate write behind bio for WriteMostly device

Commit 6607cd319b6b91bff94e90f798a61c031650b514 ("raid1: ensure write
behind bio has less than BIO_MAX_VECS sectors") tried to guarantee the
size of behind bio is not bigger than BIO_MAX_VECS sectors.

Unfortunately the same calltrace still could happen since an array could
enable write-behind without write mostly device.

To match the manpage of mdadm (which says "write-behind is only attempted
on drives marked as write-mostly"), we need to check WriteMostly flag to
avoid such unexpected behavior.

[1]. https://bugzilla.kernel.org/show_bug.cgi?id=213181#c25

Cc: stable@vger.kernel.org # v5.12+
Cc: Jens Stutte <jens@chianterastutte.eu>
Reported-by: Jens Stutte <jens@chianterastutte.eu>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agomd: properly unwind when failing to add the kobject in md_alloc
Christoph Hellwig [Wed, 1 Sep 2021 11:38:33 +0000 (13:38 +0200)]
md: properly unwind when failing to add the kobject in md_alloc

Add proper error handling to delete the gendisk when failing to add
the md kobject and clean up the error unwinding in general.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agomd: extend disks_mutex coverage
Christoph Hellwig [Wed, 1 Sep 2021 11:38:32 +0000 (13:38 +0200)]
md: extend disks_mutex coverage

disks_mutex is intended to serialize md_alloc.  Extended it to also cover
the kobject_uevent call and getting the sysfs dirent to help reducing
error handling complexity.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agomd: add the bitmap group to the default groups for the md kobject
Christoph Hellwig [Wed, 1 Sep 2021 11:38:31 +0000 (13:38 +0200)]
md: add the bitmap group to the default groups for the md kobject

Replace the deprecated default_attrs with the default_groups mechanism,
and add the always visible bitmap group to the groups created add
kobject_add time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agomd: add error handling support for add_disk()
Luis Chamberlain [Wed, 1 Sep 2021 11:38:30 +0000 (13:38 +0200)]
md: add error handling support for add_disk()

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

We just do the unwinding of what was not done before, and are
sure to unlock prior to bailing.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoswim3: add missing major.h include
Jens Axboe [Sat, 2 Oct 2021 01:23:26 +0000 (19:23 -0600)]
swim3: add missing major.h include

swim3 got this through blkdev.h previously, but blkdev.h is not including
it anymore. Include it specifically for the driver, otherwise FLOPPY_MAJOR
is undefined and breaks the compile on PPC if swim3 is configured.

Fixes: b81e0c2372e6 ("block: drop unused includes in <linux/genhd.h>")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agosx8: fix an error code in carm_init_one()
Dan Carpenter [Fri, 1 Oct 2021 12:27:22 +0000 (15:27 +0300)]
sx8: fix an error code in carm_init_one()

Return a negative error code here on this error path instead of
returning success.

Fixes: 637208e74a86 ("block/sx8: add error handling support for add_disk()")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/20211001122722.GC2283@kili
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agopf: fix error codes in pf_init_unit()
Dan Carpenter [Fri, 1 Oct 2021 12:26:54 +0000 (15:26 +0300)]
pf: fix error codes in pf_init_unit()

Return a negative error code instead of success on these error paths.

Fixes: fb367e6baeb0 ("pf: cleanup initialization")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/20211001122654.GB2283@kili
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agopcd: fix error codes in pcd_init_unit()
Dan Carpenter [Fri, 1 Oct 2021 12:26:23 +0000 (15:26 +0300)]
pcd: fix error codes in pcd_init_unit()

Return -ENODEV on these error paths instead of returning success.

Fixes: af761f277b7f ("pcd: cleanup initialization")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/20211001122623.GA2283@kili
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoxtensa/platforms/iss/simdisk: add error handling support for add_disk()
Luis Chamberlain [Mon, 27 Sep 2021 22:01:01 +0000 (15:01 -0700)]
xtensa/platforms/iss/simdisk: add error handling support for add_disk()

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Link: https://lore.kernel.org/r/20210927220110.1066271-7-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock/ataflop: add error handling support for add_disk()
Luis Chamberlain [Mon, 27 Sep 2021 22:03:02 +0000 (15:03 -0700)]
block/ataflop: add error handling support for add_disk()

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20210927220302.1073499-15-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock/ataflop: provide a helper for cleanup up an atari disk
Luis Chamberlain [Mon, 27 Sep 2021 22:03:01 +0000 (15:03 -0700)]
block/ataflop: provide a helper for cleanup up an atari disk

Instead of using two separate code paths for cleaning up an atari disk,
use one. We take the more careful approach to check for *all* disk
types, as is done on exit. The init path didn't have that check as
the alternative disk types are only probed for later, they are not
initialized by default.

Yes, there is a shared tag for all disks.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20210927220302.1073499-14-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock/ataflop: add registration bool before calling del_gendisk()
Luis Chamberlain [Mon, 27 Sep 2021 22:03:00 +0000 (15:03 -0700)]
block/ataflop: add registration bool before calling del_gendisk()

The ataflop assumes del_gendisk() is safe to call, this is only
true because add_disk() does not return a failure, but that will
change soon. And so, before we get to adding error handling for
that case, let's make sure we keep track of which disks actually
get registered. Then we use this to only call del_gendisk for them.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20210927220302.1073499-13-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock/ataflop: use the blk_cleanup_disk() helper
Luis Chamberlain [Mon, 27 Sep 2021 22:02:59 +0000 (15:02 -0700)]
block/ataflop: use the blk_cleanup_disk() helper

Use the helper to replace two lines with one.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20210927220302.1073499-12-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoswim: add error handling support for add_disk()
Luis Chamberlain [Mon, 27 Sep 2021 22:02:58 +0000 (15:02 -0700)]
swim: add error handling support for add_disk()

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Since we have a caller to do our unwinding for the disk,
and this is already dealt with safely we can re-use our
existing error path goto label which already deals with
the cleanup.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20210927220302.1073499-11-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoswim: add a floppy registration bool which triggers del_gendisk()
Luis Chamberlain [Mon, 27 Sep 2021 22:02:57 +0000 (15:02 -0700)]
swim: add a floppy registration bool which triggers del_gendisk()

Instead of calling del_gendisk() on exit alone, let's add
a registration bool to the floppy disk state, this way this can
be done on the shared caller, swim_cleanup_floppy_disk().

This will be more useful in subsequent patches. Right now, this
just shuffles functionality out to a helper in a safe way.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20210927220302.1073499-10-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoswim: add helper for disk cleanup
Luis Chamberlain [Mon, 27 Sep 2021 22:02:56 +0000 (15:02 -0700)]
swim: add helper for disk cleanup

Disk cleanup can be shared between exit and bringup. Use a
helper to do the work required. The only functional change at
this point is we're being overly paraoid on exit to check for
a null disk as well now, and this should be safe.

We'll later expand on this, this change just makes subsequent
changes easier to read.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20210927220302.1073499-9-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoswim: simplify using blk_cleanup_disk() on swim_remove()
Luis Chamberlain [Mon, 27 Sep 2021 22:02:55 +0000 (15:02 -0700)]
swim: simplify using blk_cleanup_disk() on swim_remove()

We can simplify swim_remove() by using one call instead of two,
just as other drivers do. Use that pattern.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20210927220302.1073499-8-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoamiflop: add error handling support for add_disk()
Luis Chamberlain [Mon, 27 Sep 2021 22:02:54 +0000 (15:02 -0700)]
amiflop: add error handling support for add_disk()

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling. The caller for fd_alloc_disk() deals with
the rest of the cleanup like the tag.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20210927220302.1073499-7-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agofloppy: add error handling support for add_disk()
Luis Chamberlain [Mon, 27 Sep 2021 22:02:53 +0000 (15:02 -0700)]
floppy: add error handling support for add_disk()

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20210927220302.1073499-6-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agofloppy: fix calling platform_device_unregister() on invalid drives
Luis Chamberlain [Mon, 27 Sep 2021 22:02:52 +0000 (15:02 -0700)]
floppy: fix calling platform_device_unregister() on invalid drives

platform_device_unregister() should only be called when
a respective platform_device_register() is called. However
the floppy driver currently allows failures when registring
a drive and a bail out could easily cause an invalid call
to platform_device_unregister() where it was not intended.

Fix this by adding a bool to keep track of when the platform
device was registered for a drive.

This does not fix any known panic / bug. This issue was found
through code inspection while preparing the driver to use the
up and coming support for device_add_disk() error handling.
From what I can tell from code inspection, chances of this
ever happening should be insanely small, perhaps OOM.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20210927220302.1073499-5-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agofloppy: use blk_cleanup_disk()
Luis Chamberlain [Mon, 27 Sep 2021 22:02:51 +0000 (15:02 -0700)]
floppy: use blk_cleanup_disk()

Use the blk_cleanup_queue() followed by put_disk() can be
replaced with blk_cleanup_disk(). No need for two separate
loops.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20210927220302.1073499-4-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agofloppy: fix add_disk() assumption on exit due to new developments
Luis Chamberlain [Mon, 27 Sep 2021 22:02:50 +0000 (15:02 -0700)]
floppy: fix add_disk() assumption on exit due to new developments

After the patch titled "floppy: use blk_mq_alloc_disk and
blk_cleanup_disk" the floppy driver was modified to allocate
the blk_mq_alloc_disk() which allocates the disk with the
queue. This is further clarified later with the patch titled
"block: remove alloc_disk and alloc_disk_node". This clarifies
that:

   Most drivers should use and have been converted to use
   blk_alloc_disk and blk_mq_alloc_disk.  Only the scsi
   ULPs and dasd still allocate a disk separately from the
   request_queue so don't bother with convenience macros for
   something that should not see significant new users and
   remove these wrappers.

And then we have the patch titled, "block: hold a request_queue
reference for the lifetime of struct gendisk" which ensures
that a queue is *always* present for sure during the entire
lifetime of a disk.

In the floppy driver's case then the disk always comes with the
queue. So even if even if the queue was cleaned up on exit, putting
the disk *is* still required, and likewise, blk_cleanup_queue() on
a null queue should not happen now as disk->queue is valid from
disk allocation time on.

Automatic backport code scrapers should hopefully not cherry pick
this patch as a stable fix candidate without full due dilligence to
ensure all the work done on the block layer to make this happen is
merged first.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20210927220302.1073499-3-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock/swim3: add error handling support for add_disk()
Luis Chamberlain [Mon, 27 Sep 2021 22:02:49 +0000 (15:02 -0700)]
block/swim3: add error handling support for add_disk()

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20210927220302.1073499-2-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agorbd: add add_disk() error handling
Luis Chamberlain [Mon, 27 Sep 2021 22:02:28 +0000 (15:02 -0700)]
rbd: add add_disk() error handling

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agocdrom/gdrom: add error handling support for add_disk()
Luis Chamberlain [Mon, 27 Sep 2021 22:02:27 +0000 (15:02 -0700)]
cdrom/gdrom: add error handling support for add_disk()

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agopf: add error handling support for add_disk()
Luis Chamberlain [Mon, 27 Sep 2021 22:01:56 +0000 (15:01 -0700)]
pf: add error handling support for add_disk()

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock/sx8: add error handling support for add_disk()
Luis Chamberlain [Mon, 27 Sep 2021 22:01:55 +0000 (15:01 -0700)]
block/sx8: add error handling support for add_disk()

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

A completion is used to notify the initial probe what is
happening and so we must defer error handling on completion.
Do this by remembering the error and using the shared cleanup
function.

The tags are shared and so are hanlded later for the
driver already.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock/rsxx: add error handling support for add_disk()
Luis Chamberlain [Mon, 27 Sep 2021 22:01:53 +0000 (15:01 -0700)]
block/rsxx: add error handling support for add_disk()

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agopktcdvd: add error handling support for add_disk()
Luis Chamberlain [Mon, 27 Sep 2021 22:01:49 +0000 (15:01 -0700)]
pktcdvd: add error handling support for add_disk()

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

The out_mem2 error label already does what we need so
re-use that.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agomtip32xx: add error handling support for add_disk()
Luis Chamberlain [Mon, 27 Sep 2021 22:01:48 +0000 (15:01 -0700)]
mtip32xx: add error handling support for add_disk()

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

The read_capacity_error error label already does what we need,
so just re-use that.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agopd: add error handling support for add_disk()
Luis Chamberlain [Mon, 27 Sep 2021 22:01:10 +0000 (15:01 -0700)]
pd: add error handling support for add_disk()

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agopcd: capture errors on cdrom_register()
Luis Chamberlain [Mon, 27 Sep 2021 22:01:09 +0000 (15:01 -0700)]
pcd: capture errors on cdrom_register()

No errors were being captured wehen cdrom_register() fails,
capture the error and return the error.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agopcd: fix ordering of unregister_cdrom()
Luis Chamberlain [Mon, 27 Sep 2021 22:01:08 +0000 (15:01 -0700)]
pcd: fix ordering of unregister_cdrom()

We first register cdrom and then we add_disk() and
so we we should likewise unregister the cdrom first and
then del_gendisk().

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agopcd: add error handling support for add_disk()
Luis Chamberlain [Mon, 27 Sep 2021 22:01:07 +0000 (15:01 -0700)]
pcd: add error handling support for add_disk()

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agopd: cleanup initialization
Christoph Hellwig [Mon, 27 Sep 2021 22:01:06 +0000 (15:01 -0700)]
pd: cleanup initialization

Refactor the pf initialization to have a dedicated helper to initialize
a single disk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agopf: cleanup initialization
Christoph Hellwig [Mon, 27 Sep 2021 22:01:05 +0000 (15:01 -0700)]
pf: cleanup initialization

Refactor the pf initialization to have a dedicated helper to initialize
a single disk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agopcd: cleanup initialization
Christoph Hellwig [Mon, 27 Sep 2021 22:01:04 +0000 (15:01 -0700)]
pcd: cleanup initialization

Refactor the pcd initialization to have a dedicated helper to initialize
a single disk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agopcd: move the identify buffer into pcd_identify
Christoph Hellwig [Mon, 27 Sep 2021 22:01:03 +0000 (15:01 -0700)]
pcd: move the identify buffer into pcd_identify

No need to pass it through a bunch of functions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agon64cart: add error handling support for add_disk()
Luis Chamberlain [Mon, 27 Sep 2021 22:01:02 +0000 (15:01 -0700)]
n64cart: add error handling support for add_disk()

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agodrbd: add error handling support for add_disk()
Luis Chamberlain [Mon, 27 Sep 2021 22:00:59 +0000 (15:00 -0700)]
drbd: add error handling support for add_disk()

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoaoe: add error handling support for add_disk()
Luis Chamberlain [Mon, 27 Sep 2021 22:00:57 +0000 (15:00 -0700)]
aoe: add error handling support for add_disk()

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonbd: add error handling support for add_disk()
Luis Chamberlain [Mon, 27 Sep 2021 21:59:58 +0000 (14:59 -0700)]
nbd: add error handling support for add_disk()

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoloop: add error handling support for add_disk()
Luis Chamberlain [Mon, 27 Sep 2021 21:59:57 +0000 (14:59 -0700)]
loop: add error handling support for add_disk()

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonull_blk: poll queue support
Jens Axboe [Sat, 17 Apr 2021 15:29:49 +0000 (09:29 -0600)]
null_blk: poll queue support

There's currently no way to experiment with polled IO with null_blk,
which seems like an oversight. This patch adds support for polled IO.
We keep a list of issued IOs on submit, and then process that list
when mq_ops->poll() is invoked.

A new parameter is added, poll_queues. It defaults to 1 like the
submit queues, meaning we'll have 1 poll queue available.

Fixes-by: Bart Van Assche <bvanassche@acm.org>
Fixes-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/baca710d-0f2a-16e2-60bd-b105b854e0ae@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonvme: wire up completion batching for the IRQ path
Jens Axboe [Mon, 18 Oct 2021 14:45:39 +0000 (08:45 -0600)]
nvme: wire up completion batching for the IRQ path

Trivial to do now, just need our own io_comp_batch on the stack and pass
that in to the usual command completion handling.

I pondered making this dependent on how many entries we had to process,
but even for a single entry there's no discernable difference in
performance or latency. Running a sync workload over io_uring:

t/io_uring -b512 -d1 -s1 -c1 -p0 -F1 -B1 -n2 /dev/nvme1n1 /dev/nvme2n1

yields the below performance before the patch:

IOPS=254820, BW=124MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=251174, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=250806, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)

and the following after:

IOPS=255972, BW=124MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=251920, BW=123MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=251794, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)

which definitely isn't slower, about the same if you factor in a bit of
variance. For peak performance workloads, benchmarking shows a 2%
improvement.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: utilize the io batching infrastructure for more efficient polled IO
Jens Axboe [Tue, 12 Oct 2021 15:28:46 +0000 (09:28 -0600)]
io_uring: utilize the io batching infrastructure for more efficient polled IO

Wire up using an io_comp_batch for f_op->iopoll(). If the lower stack
supports it, we can handle high rates of polled IO more efficiently.

This raises the single core efficiency on my system from ~6.1M IOPS to
~6.6M IOPS running a random read workload at depth 128 on two gen2
Optane drives.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonvme: add support for batched completion of polled IO
Jens Axboe [Fri, 8 Oct 2021 11:59:37 +0000 (05:59 -0600)]
nvme: add support for batched completion of polled IO

Take advantage of struct io_comp_batch, if passed in to the nvme poll
handler. If it's set, rather than complete each request individually
inline, store them in the io_comp_batch list. We only do so for requests
that will complete successfully, anything else will be completed inline as
before.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: add support for blk_mq_end_request_batch()
Jens Axboe [Fri, 8 Oct 2021 11:50:46 +0000 (05:50 -0600)]
block: add support for blk_mq_end_request_batch()

Instead of calling blk_mq_end_request() on a single request, add a helper
that takes the new struct io_comp_batch and completes any request stored
in there.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agosbitmap: add helper to clear a batch of tags
Jens Axboe [Fri, 8 Oct 2021 11:44:23 +0000 (05:44 -0600)]
sbitmap: add helper to clear a batch of tags

sbitmap currently only supports clearing tags one-by-one, add a helper
that allows the caller to pass in an array of tags to clear.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: add a struct io_comp_batch argument to fops->iopoll()
Jens Axboe [Tue, 12 Oct 2021 15:24:29 +0000 (09:24 -0600)]
block: add a struct io_comp_batch argument to fops->iopoll()

struct io_comp_batch contains a list head and a completion handler, which
will allow completions to more effciently completed batches of IO.

For now, no functional changes in this patch, we just define the
io_comp_batch structure and add the argument to the file_operations iopoll
handler.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: provide helpers for rq_list manipulation
Jens Axboe [Wed, 13 Oct 2021 13:58:52 +0000 (07:58 -0600)]
block: provide helpers for rq_list manipulation

Instead of open-coding the list additions, traversal, and removal,
provide a basic set of helpers.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: remove some blk_mq_hw_ctx debugfs entries
Jens Axboe [Mon, 18 Oct 2021 14:53:19 +0000 (08:53 -0600)]
block: remove some blk_mq_hw_ctx debugfs entries

Just like the blk_mq_ctx counterparts, we've got a bunch of counters
in here that are only for debugfs and are of questionnable value. They
are:

- dispatched, index of how many requests were dispatched in one go

- poll_{considered,invoked,success}, which track poll sucess rates. We're
  confident in the iopoll implementation at this point, don't bother
  tracking these.

As a bonus, this shrinks each hardware queue from 576 bytes to 512 bytes,
dropping a whole cacheline.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: remove debugfs blk_mq_ctx dispatched/merged/completed attributes
Jens Axboe [Sat, 16 Oct 2021 23:27:20 +0000 (17:27 -0600)]
block: remove debugfs blk_mq_ctx dispatched/merged/completed attributes

These were added as part of early days debugging for blk-mq, and they
are not really useful anymore. Rather than spend cycles updating them,
just get rid of them.

As a bonus, this shrinks the per-cpu software queue size from 256b
to 192b. That's a whole cacheline less.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: cache rq_flags inside blk_mq_rq_ctx_init()
Pavel Begunkov [Mon, 18 Oct 2021 20:37:29 +0000 (21:37 +0100)]
block: cache rq_flags inside blk_mq_rq_ctx_init()

Add a local variable for rq_flags, it helps to compile out some of
rq_flags reloads.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: blk_mq_rq_ctx_init cache ctx/q/hctx
Pavel Begunkov [Mon, 18 Oct 2021 20:37:28 +0000 (21:37 +0100)]
block: blk_mq_rq_ctx_init cache ctx/q/hctx

We should have enough of registers in blk_mq_rq_ctx_init(), store them
in local vars, so we don't keep reloading them.

note: keeping q->elevator may look unnecessary, but it's also used
inside inlined blk_mq_tags_from_data().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: skip elevator fields init for non-elv queue
Pavel Begunkov [Mon, 18 Oct 2021 20:37:27 +0000 (21:37 +0100)]
block: skip elevator fields init for non-elv queue

Don't init rq->hash and rq->rb_node in blk_mq_rq_ctx_init() if there is
no elevator. Also, move some other initialisers that imply barriers to
the end, so the compiler is free to rearrange and optimise other the
rest of them.

note: fold in a change from Jens leaving queue_list unconditional, as
it might lead to problems otherwise.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: store elevator state in request
Jens Axboe [Fri, 15 Oct 2021 15:44:38 +0000 (09:44 -0600)]
block: store elevator state in request

Add an rq private RQF_ELV flag, which tells the block layer that this
request was initialized on a queue that has an IO scheduler attached.
This allows for faster checking in the fast path, rather than having to
deference rq->q later on.

Elevator switching does full quiesce of the queue before detaching an
IO scheduler, so it's safe to cache this in the request itself.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: only mark bio as tracked if it really is tracked
Jens Axboe [Sat, 16 Oct 2021 02:06:18 +0000 (20:06 -0600)]
block: only mark bio as tracked if it really is tracked

We set BIO_TRACKED unconditionally when rq_qos_throttle() is called, even
though we may not even have an rq_qos handler. Only mark it as TRACKED if
it really is potentially tracked.

This saves considerable time for the case where the bio isn't tracked:

     2.64%     -1.65%  [kernel.vmlinux]  [k] bio_endio

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: improve layout of struct request
Jens Axboe [Fri, 15 Oct 2021 21:03:52 +0000 (15:03 -0600)]
block: improve layout of struct request

It's been a while since this was analyzed, move some members around to
better flow with the use case. Initial state up top, and queued state
after that. This improves my peak case by about 1.5%, from 7750K to
7900K IOPS.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: move update request helpers into blk-mq.c
Jens Axboe [Thu, 14 Oct 2021 15:17:01 +0000 (09:17 -0600)]
block: move update request helpers into blk-mq.c

For some reason we still have them in blk-core, with the rest of the
request completion being in blk-mq. That causes and out-of-line call
for each completion.

Move them into blk-mq.c instead, where they belong.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: remove useless caller argument to print_req_error()
Jens Axboe [Thu, 14 Oct 2021 15:15:40 +0000 (09:15 -0600)]
block: remove useless caller argument to print_req_error()

We have exactly one caller of this, just get rid of adding the useless
function name to the output.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>