Pavel Begunkov [Fri, 24 Sep 2021 20:59:46 +0000 (21:59 +0100)]
io-wq: add io_wq_work_node based stack
Apart from just using lists (i.e. io_wq_work_list), we also want to have
stacks, which are a bit faster, and have some interoperability between
them. Add a stack implementation based on io_wq_work_node and some
helpers.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5d3a412a5ac0d47e0f0499d70d2207d70a68925e.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 24 Sep 2021 20:59:45 +0000 (21:59 +0100)]
io_uring: remove allocation cache array
We have several of request allocation layers, remove the last one, which
is the submit->reqs array, and always use submit->free_reqs instead.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8547095c35f7a87bab14f6447ecd30a273ed7500.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 24 Sep 2021 20:59:44 +0000 (21:59 +0100)]
io_uring: use slist for completion batching
Currently we collect requests for completion batching in an array.
Replace them with a singly linked list. It's as fast as arrays but
doesn't take some much space in ctx, and will be used in future patches.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a666826f2854d17e9fb9417fb302edfeb750f425.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 24 Sep 2021 20:59:43 +0000 (21:59 +0100)]
io_uring: make io_do_iopoll return number of reqs
Don't pass nr_events pointer around but return directly, it's less
expensive than pointer increments.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f771a8153a86f16f12ff4272524e9e549c5de40b.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 24 Sep 2021 20:59:42 +0000 (21:59 +0100)]
io_uring: force_nonspin
We don't really need to pass the number of requests to complete into
io_do_iopoll(), a flag whether to enforce non-spin mode is enough.
Should be straightforward, maybe except io_iopoll_check(). We pass !min
there, because we do never enter with the number of already reaped
requests is larger than the specified @min, apart from the first
iteration, where nr_events is 0 and so the final check should be
identical.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/782b39d1d8ec584eae15bca0a1feb6f0571fe5b8.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 24 Sep 2021 20:59:41 +0000 (21:59 +0100)]
io_uring: mark having different creds unlikely
Hint the compiler that it's not as likely to have creds different from
current attached to a request. The current code generation is far from
ideal, hopefully it can help to some compilers to remove duplicated jump
tables and so.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e7815251ac4bf5a4a23d298c752f029ae19f3837.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hao Xu [Wed, 22 Sep 2021 10:15:22 +0000 (18:15 +0800)]
io_uring: return boolean value for io_alloc_async_data
boolean value is good enough for io_alloc_async_data.
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210922101522.9179-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 15 Sep 2021 11:03:38 +0000 (12:03 +0100)]
io_uring: optimise io_req_init() sqe flags checks
IOSQE_IO_DRAIN is quite marginal and we don't care too much about
IOSQE_BUFFER_SELECT. Save to ifs and hide both of them under
SQE_VALID_FLAGS check. Now we first check whether it uses a "safe"
subset, i.e. without DRAIN and BUFFER_SELECT, and only if it's not
true we test the rest of the flags.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/dccfb9ab2ab0969a2d8dc59af88fa0ce44eeb1d5.1631703764.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 15 Sep 2021 11:04:20 +0000 (12:04 +0100)]
io_uring: remove ctx referencing from complete_post
Now completions are done from task context, that means that it's either
the task itself, task_work or io-wq worker. In all those cases the ctx
will be staying alive by mutexing, explicit referencing or req references
by iowq. Remove extra ctx pinning from io_req_complete_post().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/60a0e96434c16ab4fe587651448290d61ec9a113.1631703756.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hao Xu [Mon, 13 Sep 2021 13:08:54 +0000 (21:08 +0800)]
io_uring: add more uring info to fdinfo for debug
Developers may need some uring info to help themselves debug and address
issues in production. This includes sqring/cqring head/tail and the
detailed sqe/cqe info, which is very useful when an application is hung
on a ring.
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210913130854.38542-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 8 Sep 2021 15:40:53 +0000 (16:40 +0100)]
io_uring: kill extra wake_up_process in tw add
TWA_SIGNAL already wakes the thread, no need in wake_up_process() after
it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7e90cf643f633e857443e0c9e72471b221735c50.1631115443.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 8 Sep 2021 15:40:52 +0000 (16:40 +0100)]
io_uring: dedup CQE flushing non-empty checks
We don't do io_submit_flush_completions() when there is no requests
enqueued, and every single caller checks for it. Hide that check into
the function not forgetting about inlining. That will make it much
easier for changing the empty check condition in the future.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d7ff8cef5da1b38e8ea648f5aad9a315ddfc7b57.1631115443.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 8 Sep 2021 15:40:51 +0000 (16:40 +0100)]
io_uring: inline linked part of io_req_find_next
Inline part of __io_req_find_next() that returns a request but doesn't
need io_disarm_next(). It's just two places, but makes links a bit
faster.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4126d13f23d0e91b39b3558e16bd86cafa7fcef2.1631115443.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 8 Sep 2021 15:40:50 +0000 (16:40 +0100)]
io_uring: inline io_dismantle_req
io_dismantle_req() is hot, and not _too_ huge. Inline it, there are 3
call sites, which hopefully will turn into 2 in the future.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/bdd2dc30716cac270c2403e99bccd6286e4ae201.1631115443.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 8 Sep 2021 15:40:49 +0000 (16:40 +0100)]
io_uring: kill off ios_left
->ios_left is only used to decide whether to plug or not, kill it to
avoid this extra accounting, just use the initial submission number.
There is no much difference in regards of enabling plugging, where this
one does it in a few more cases, but all major ones should be covered
well.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f13993bcf5b477f9a7d52881fc49f9457ea9870a.1631115443.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bixuan Cui [Sat, 11 Sep 2021 08:58:47 +0000 (16:58 +0800)]
io-wq: Remove duplicate code in io_workqueue_create()
While task_work_add() in io_workqueue_create() is true,
then duplicate code is executed:
-> clear_bit_unlock(0, &worker->create_state);
-> io_worker_release(worker);
-> atomic_dec(&acct->nr_running);
-> io_worker_ref_put(wq);
-> return false;
-> clear_bit_unlock(0, &worker->create_state); // back to io_workqueue_create()
-> io_worker_release(worker);
-> kfree(worker);
The io_worker_release() and clear_bit_unlock() are executed twice.
Fixes:
3146cba99aa2 ("io-wq: make worker creation resilient against signals")
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Link: https://lore.kernel.org/r/20210911085847.34849-1-cuibixuan@huawei.com
Reviwed-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sat, 11 Sep 2021 22:04:50 +0000 (16:04 -0600)]
io_uring: dump sqe contents if issue fails
I recently had to look at a production problem where a request ended
up getting the dreaded -EINVAL error on submit. The most used and
hence useless of error codes, as it just tells you that something
was wrong with your request, but not more than that.
Let's dump the full sqe contents if we run into an issue failure,
that'll allow easier diagnosing of a wide variety of issues.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 19 Oct 2021 02:54:39 +0000 (20:54 -0600)]
block: fix too broad elevator check in blk_mq_free_request()
We added RQF_ELV to tell whether there's an IO scheduler attached, and
RQF_ELVPRIV tells us whether there's an IO scheduler with private data
attached. Don't check RQF_ELV in blk_mq_free_request(), what we care
about here is just if we have scheduler private data attached.
This fixes a boot crash
Fixes:
2ff0682da6e0 ("block: store elevator state in request")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Reported-by: syzbot+eb8104072aeab6cc1195@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Mon, 18 Oct 2021 14:45:39 +0000 (08:45 -0600)]
nvme: wire up completion batching for the IRQ path
Trivial to do now, just need our own io_comp_batch on the stack and pass
that in to the usual command completion handling.
I pondered making this dependent on how many entries we had to process,
but even for a single entry there's no discernable difference in
performance or latency. Running a sync workload over io_uring:
t/io_uring -b512 -d1 -s1 -c1 -p0 -F1 -B1 -n2 /dev/nvme1n1 /dev/nvme2n1
yields the below performance before the patch:
IOPS=254820, BW=124MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=251174, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=250806, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)
and the following after:
IOPS=255972, BW=124MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=251920, BW=123MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=251794, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)
which definitely isn't slower, about the same if you factor in a bit of
variance. For peak performance workloads, benchmarking shows a 2%
improvement.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 12 Oct 2021 15:28:46 +0000 (09:28 -0600)]
io_uring: utilize the io batching infrastructure for more efficient polled IO
Wire up using an io_comp_batch for f_op->iopoll(). If the lower stack
supports it, we can handle high rates of polled IO more efficiently.
This raises the single core efficiency on my system from ~6.1M IOPS to
~6.6M IOPS running a random read workload at depth 128 on two gen2
Optane drives.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 8 Oct 2021 11:59:37 +0000 (05:59 -0600)]
nvme: add support for batched completion of polled IO
Take advantage of struct io_comp_batch, if passed in to the nvme poll
handler. If it's set, rather than complete each request individually
inline, store them in the io_comp_batch list. We only do so for requests
that will complete successfully, anything else will be completed inline as
before.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 8 Oct 2021 11:50:46 +0000 (05:50 -0600)]
block: add support for blk_mq_end_request_batch()
Instead of calling blk_mq_end_request() on a single request, add a helper
that takes the new struct io_comp_batch and completes any request stored
in there.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 8 Oct 2021 11:44:23 +0000 (05:44 -0600)]
sbitmap: add helper to clear a batch of tags
sbitmap currently only supports clearing tags one-by-one, add a helper
that allows the caller to pass in an array of tags to clear.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 12 Oct 2021 15:24:29 +0000 (09:24 -0600)]
block: add a struct io_comp_batch argument to fops->iopoll()
struct io_comp_batch contains a list head and a completion handler, which
will allow completions to more effciently completed batches of IO.
For now, no functional changes in this patch, we just define the
io_comp_batch structure and add the argument to the file_operations iopoll
handler.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 13 Oct 2021 13:58:52 +0000 (07:58 -0600)]
block: provide helpers for rq_list manipulation
Instead of open-coding the list additions, traversal, and removal,
provide a basic set of helpers.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Mon, 18 Oct 2021 14:53:19 +0000 (08:53 -0600)]
block: remove some blk_mq_hw_ctx debugfs entries
Just like the blk_mq_ctx counterparts, we've got a bunch of counters
in here that are only for debugfs and are of questionnable value. They
are:
- dispatched, index of how many requests were dispatched in one go
- poll_{considered,invoked,success}, which track poll sucess rates. We're
confident in the iopoll implementation at this point, don't bother
tracking these.
As a bonus, this shrinks each hardware queue from 576 bytes to 512 bytes,
dropping a whole cacheline.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sat, 16 Oct 2021 23:27:20 +0000 (17:27 -0600)]
block: remove debugfs blk_mq_ctx dispatched/merged/completed attributes
These were added as part of early days debugging for blk-mq, and they
are not really useful anymore. Rather than spend cycles updating them,
just get rid of them.
As a bonus, this shrinks the per-cpu software queue size from 256b
to 192b. That's a whole cacheline less.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 18 Oct 2021 20:37:29 +0000 (21:37 +0100)]
block: cache rq_flags inside blk_mq_rq_ctx_init()
Add a local variable for rq_flags, it helps to compile out some of
rq_flags reloads.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 18 Oct 2021 20:37:28 +0000 (21:37 +0100)]
block: blk_mq_rq_ctx_init cache ctx/q/hctx
We should have enough of registers in blk_mq_rq_ctx_init(), store them
in local vars, so we don't keep reloading them.
note: keeping q->elevator may look unnecessary, but it's also used
inside inlined blk_mq_tags_from_data().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 18 Oct 2021 20:37:27 +0000 (21:37 +0100)]
block: skip elevator fields init for non-elv queue
Don't init rq->hash and rq->rb_node in blk_mq_rq_ctx_init() if there is
no elevator. Also, move some other initialisers that imply barriers to
the end, so the compiler is free to rearrange and optimise other the
rest of them.
note: fold in a change from Jens leaving queue_list unconditional, as
it might lead to problems otherwise.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 15 Oct 2021 15:44:38 +0000 (09:44 -0600)]
block: store elevator state in request
Add an rq private RQF_ELV flag, which tells the block layer that this
request was initialized on a queue that has an IO scheduler attached.
This allows for faster checking in the fast path, rather than having to
deference rq->q later on.
Elevator switching does full quiesce of the queue before detaching an
IO scheduler, so it's safe to cache this in the request itself.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sat, 16 Oct 2021 02:06:18 +0000 (20:06 -0600)]
block: only mark bio as tracked if it really is tracked
We set BIO_TRACKED unconditionally when rq_qos_throttle() is called, even
though we may not even have an rq_qos handler. Only mark it as TRACKED if
it really is potentially tracked.
This saves considerable time for the case where the bio isn't tracked:
2.64% -1.65% [kernel.vmlinux] [k] bio_endio
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 15 Oct 2021 21:03:52 +0000 (15:03 -0600)]
block: improve layout of struct request
It's been a while since this was analyzed, move some members around to
better flow with the use case. Initial state up top, and queued state
after that. This improves my peak case by about 1.5%, from 7750K to
7900K IOPS.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 14 Oct 2021 15:17:01 +0000 (09:17 -0600)]
block: move update request helpers into blk-mq.c
For some reason we still have them in blk-core, with the rest of the
request completion being in blk-mq. That causes and out-of-line call
for each completion.
Move them into blk-mq.c instead, where they belong.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 14 Oct 2021 15:15:40 +0000 (09:15 -0600)]
block: remove useless caller argument to print_req_error()
We have exactly one caller of this, just get rid of adding the useless
function name to the output.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 13 Oct 2021 15:01:43 +0000 (09:01 -0600)]
block: don't bother iter advancing a fully done bio
If we're completing nbytes and nbytes is the size of the bio, don't bother
with calling into the iterator increment helpers. Just clear the bio
size and we're done.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 14 Oct 2021 14:03:30 +0000 (15:03 +0100)]
block: convert the rest of block to bdev_get_queue
Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached
queue pointer and so is faster.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/addf6ea988c04213697ba3684c853e4ed7642a39.1634219547.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 14 Oct 2021 14:03:29 +0000 (15:03 +0100)]
block: use bdev_get_queue() in blk-core.c
Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached
queue pointer and so is faster.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/efc41f880262517c8dc32f932f1b23112f21b255.1634219547.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 14 Oct 2021 14:03:28 +0000 (15:03 +0100)]
block: use bdev_get_queue() in bio.c
Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached
queue pointer and so is faster.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/85c36ea784d285a5075baa10049e6b59e15fb484.1634219547.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 14 Oct 2021 14:03:27 +0000 (15:03 +0100)]
block: use bdev_get_queue() in bdev.c
Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached
queue pointer and so is faster.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a352936ce5d9ac719645b1e29b173d931ebcdc02.1634219547.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 14 Oct 2021 14:03:26 +0000 (15:03 +0100)]
block: cache request queue in bdev
There are tons of places where we need to get a request_queue only
having bdev, which turns into bdev->bd_disk->queue. There are probably a
hundred of such places considering inline helpers, and enough of them
are in hot paths.
Cache queue pointer in struct block_device and make use of it in
bdev_get_queue().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a3bfaecdd28956f03629d0ca5c63ebc096e1c809.1634219547.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 13 Oct 2021 18:43:41 +0000 (12:43 -0600)]
block: handle fast path of bio splitting inline
The fast path is no splitting needed. Separate the handling into a
check part we can inline, and an out-of-line handling path if we do
need to split.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 14 Oct 2021 17:17:43 +0000 (11:17 -0600)]
block: use flags instead of bit fields for blkdev_dio
This generates a lot better code for me, and bumps performance from
7650K IOPS to 7750K IOPS. Looking at profiles for the run and running
perf diff, it confirms that we're now sending a lot less time there:
6.38% -2.80% [kernel.vmlinux] [k] blkdev_direct_IO
Taking it from the 2nd most cycle consumer to only the 9th most at
3.35% of the CPU time.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 13 Oct 2021 08:57:11 +0000 (09:57 +0100)]
block: cache bdev in struct file for raw bdev IO
bdev = &BDEV_I(file->f_mapping->host)->bdev
Getting struct block_device from a file requires 2 memory dereferences
as illustrated above, that takes a toll on performance, so cache it in
yet unused file->private_data. That gives a noticeable peak performance
improvement.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/8415f9fe12e544b9da89593dfbca8de2b52efe03.1634115360.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 11:12:26 +0000 (13:12 +0200)]
nvme-multipath: enable polled I/O
Set the poll queue flag to enable polling, given that the multipath
node just dispatches the bios to a lower queue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 11:12:25 +0000 (13:12 +0200)]
block: don't allow writing to the poll queue attribute
The poll attribute is a historic artefact from before when we had
explicit poll queues that require driver specific configuration.
Just print a warning when writing to the attribute.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 11:12:24 +0000 (13:12 +0200)]
block: switch polling to be bio based
Replace the blk_poll interface that requires the caller to keep a queue
and cookie from the submissions with polling based on the bio.
Polling for the bio itself leads to a few advantages:
- the cookie construction can made entirely private in blk-mq.c
- the caller does not need to remember the request_queue and cookie
separately and thus sidesteps their lifetime issues
- keeping the device and the cookie inside the bio allows to trivially
support polling BIOs remapping by stacking drivers
- a lot of code to propagate the cookie back up the submission path can
be removed entirely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 12 Oct 2021 11:12:23 +0000 (13:12 +0200)]
block: define 'struct bvec_iter' as packed
'struct bvec_iter' is embedded into 'struct bio', define it as packed
so that we can get one extra 4bytes for other uses without expanding
bio.
'struct bvec_iter' is often allocated on stack, so making it packed
doesn't affect performance. Also I have run io_uring on both
nvme/null_blk, and not observe performance effect in this way.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 11:12:22 +0000 (13:12 +0200)]
block: use SLAB_TYPESAFE_BY_RCU for the bio slab
This flags ensures that the pages will not be reused for non-bio
allocations before the end of an RCU grace period. With that we can
safely use a RCU lookup for bio polling as long as we are fine with
occasionally polling the wrong device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 11:12:21 +0000 (13:12 +0200)]
block: rename REQ_HIPRI to REQ_POLLED
Unlike the RWF_HIPRI userspace ABI which is intentionally kept vague,
the bio flag is specific to the polling implementation, so rename and
document it properly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 11:12:20 +0000 (13:12 +0200)]
io_uring: don't sleep when polling for I/O
There is no point in sleeping for the expected I/O completion timeout
in the io_uring async polling model as we never poll for a specific
I/O.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 11:12:19 +0000 (13:12 +0200)]
block: replace the spin argument to blk_iopoll with a flags argument
Switch the boolean spin argument to blk_poll to passing a set of flags
instead. This will allow to control polling behavior in a more fine
grained way.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-10-hch@lst.de
[axboe: adapt to changed io_uring iopoll]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 11:12:18 +0000 (13:12 +0200)]
blk-mq: remove blk_qc_t_valid
Move the trivial check into the only caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 11:12:17 +0000 (13:12 +0200)]
blk-mq: remove blk_qc_t_to_tag and blk_qc_t_is_internal
Merge both functions into their only caller to keep the blk-mq tag to
blk_qc_t mapping as private as possible in blk-mq.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 11:12:16 +0000 (13:12 +0200)]
blk-mq: factor out a "classic" poll helper
Factor the code to do the classic full metal polling out of blk_poll into
a separate blk_mq_poll_classic helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 11:12:15 +0000 (13:12 +0200)]
blk-mq: factor out a blk_qc_to_hctx helper
Add a helper to get the hctx from a request_queue and cookie, and fold
the blk_qc_t_to_queue_num helper into it as no other callers are left.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 11:12:14 +0000 (13:12 +0200)]
io_uring: fix a layering violation in io_iopoll_req_issued
syscall-level code can't just poke into the details of the poll cookie,
which is private information of the block layer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012111226.760968-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 11:12:13 +0000 (13:12 +0200)]
iomap: don't try to poll multi-bio I/Os in __iomap_dio_rw
If an iocb is split into multiple bios we can't poll for both. So don't
bother to even try to poll in that case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 11:12:12 +0000 (13:12 +0200)]
block: don't try to poll multi-bio I/Os in __blkdev_direct_IO
If an iocb is split into multiple bios we can't poll for both. So don't
even bother to try to poll in that case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012111226.760968-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 11:12:11 +0000 (13:12 +0200)]
direct-io: remove blk_poll support
The polling support in the legacy direct-io support is a little crufty.
It already doesn't support the asynchronous polling needed for io_uring
polling, and is hard to adopt to upcoming changes in the polling
interfaces. Given that all the major file systems already use the iomap
direct I/O code, just drop the polling support.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 14 Oct 2021 13:24:07 +0000 (07:24 -0600)]
block: only check previous entry for plug merge attempt
Currently we scan the entire plug list, which is potentially very
expensive. In an IOPS bound workload, we can drive about 5.6M IOPS with
merging enabled, and profiling shows that the plug merge check is the
(by far) most expensive thing we're doing:
Overhead Command Shared Object Symbol
+ 20.89% io_uring [kernel.vmlinux] [k] blk_attempt_plug_merge
+ 4.98% io_uring [kernel.vmlinux] [k] io_submit_sqes
+ 4.78% io_uring [kernel.vmlinux] [k] blkdev_direct_IO
+ 4.61% io_uring [kernel.vmlinux] [k] blk_mq_submit_bio
Instead of browsing the whole list, just check the previously inserted
entry. That is enough for a naive merge check and will catch most cases,
and for devices that need full merging, the IO scheduler attached to
such devices will do that anyway. The plug merge is meant to be an
inexpensive check to avoid getting a request, but if we repeatedly
scan the list for every single insert, it is very much not a cheap
check.
With this patch, the workload instead runs at ~7.0M IOPS, providing
a 25% improvement. Disabling merging entirely yields another 5%
improvement.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Masahiro Yamada [Mon, 27 Sep 2021 14:00:00 +0000 (23:00 +0900)]
block: move CONFIG_BLOCK guard to top Makefile
Every object under block/ depends on CONFIG_BLOCK.
Move the guard to the top Makefile since there is no point to
descend into block/ if CONFIG_BLOCK=n.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210927140000.866249-5-masahiroy@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Masahiro Yamada [Mon, 27 Sep 2021 13:59:59 +0000 (22:59 +0900)]
block: move menu "Partition type" to block/partitions/Kconfig
Move the menu to the relevant place.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210927140000.866249-4-masahiroy@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Masahiro Yamada [Mon, 27 Sep 2021 13:59:58 +0000 (22:59 +0900)]
block: simplify Kconfig files
Everything under block/ depends on BLOCK. BLOCK_HOLDER_DEPRECATED is
selected from drivers/md/Kconfig, which is entirely dependent on BLOCK.
Extend the 'if BLOCK' ... 'endif' so it covers the whole block/Kconfig.
Also, clean up the definition of BLOCK_COMPAT and BLK_MQ_PCI because
COMPAT and PCI are boolean.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210927140000.866249-3-masahiroy@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Masahiro Yamada [Mon, 27 Sep 2021 13:59:57 +0000 (22:59 +0900)]
block: remove redundant =y from BLK_CGROUP dependency
CONFIG_BLK_CGROUP is a boolean option, that is, its value is 'y' or 'n'.
The comparison to 'y' is redundant.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210927140000.866249-2-masahiroy@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sat, 9 Oct 2021 19:10:39 +0000 (13:10 -0600)]
block: improve batched tag allocation
Add a blk_mq_get_tags() helper, which uses the new sbitmap API for
allocating a batch of tags all at once. This both simplifies the block
code for batched allocation, and it is also more efficient than just
doing repeated calls into __sbitmap_queue_get().
This reduces the sbitmap overhead in peak runs from ~3% to ~1% and
yields a performanc increase from 6.6M IOPS to 6.8M IOPS for a single
CPU core.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sat, 9 Oct 2021 19:02:23 +0000 (13:02 -0600)]
sbitmap: add __sbitmap_queue_get_batch()
The block layer tag allocation batching still calls into sbitmap to get
each tag, but we can improve on that. Add __sbitmap_queue_get_batch(),
which returns a mask of tags all at once, along with an offset for
those tags.
An example return would be 0xff, where bits 0..7 are set, with
tag_offset == 128. The valid tags in this case would be 128..135.
A batch is specific to an individual sbitmap_map, hence it cannot be
larger than that. The requested number of tags is automatically reduced
to the max that can be satisfied with a single map.
On failure, 0 is returned. Caller should fall back to single tag
allocation at that point/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 13 Oct 2021 08:57:13 +0000 (09:57 +0100)]
blk-mq: optimise *end_request non-stat path
We already have a blk_mq_need_time_stamp() check in
__blk_mq_end_request() to get a timestamp, hide all the statistics
accounting under it. It cuts some cycles for requests that don't need
stats, and is free otherwise.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/e0f2ea812e93a8adcd07101212e7d7e70ca304e7.1634115360.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 16:18:04 +0000 (18:18 +0200)]
block: mark bio_truncate static
bio_truncate is only used in bio.c, so mark it static.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 16:18:03 +0000 (18:18 +0200)]
block: move bio_get_{first,last}_bvec out of bio.h
bio_get_first_bvec and bio_get_last_bvec are only used in blk-merge.c,
so move them there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 16:18:02 +0000 (18:18 +0200)]
block: mark __bio_try_merge_page static
Mark __bio_try_merge_page static and move it up a bit to avoid the need
for a forward declaration.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 16:18:01 +0000 (18:18 +0200)]
block: move bio_full out of bio.h
bio_full is only used in bio.c, so move it there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 16:18:00 +0000 (18:18 +0200)]
block: fold bio_cur_bytes into blk_rq_cur_bytes
Fold bio_cur_bytes into the only caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 16:17:59 +0000 (18:17 +0200)]
block: move bio_mergeable out of bio.h
bio_mergeable is only needed by I/O schedulers, so move it to
blk-mq-sched.h.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 16:17:58 +0000 (18:17 +0200)]
block: don't include <linux/ioprio.h> in <linux/bio.h>
bio.h doesn't need any of the definitions from ioprio.h.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 16:17:57 +0000 (18:17 +0200)]
block: remove BIO_BUG_ON
BIO_DEBUG is always defined, so just switch the two instances to use
BUG_ON directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Sat, 9 Oct 2021 12:25:42 +0000 (13:25 +0100)]
blk-mq: inline hot part of __blk_mq_sched_restart
Extract a fast check out of __block_mq_sched_restart() and inline it for
performance reasons.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/894abaa0998e5999f2fe18f271e5efdfc2c32bd2.1633781740.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Sat, 9 Oct 2021 12:25:41 +0000 (13:25 +0100)]
block: inline hot paths of blk_account_io_*()
Extract hot paths of __blk_account_io_start() and
__blk_account_io_done() into inline functions, so we don't always pay
for function calls.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b0662a636bd4cc7b4f84c9d0a41efa46a688ef13.1633781740.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 10:44:50 +0000 (12:44 +0200)]
block: merge block_ioctl into blkdev_ioctl
Simplify the ioctl path and match the code structure on the compat side.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012104450.659013-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 10:44:49 +0000 (12:44 +0200)]
block: move the *blkdev_ioctl declarations out of blkdev.h
These are only used inside of block/.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012104450.659013-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 10:44:48 +0000 (12:44 +0200)]
block: unexport blkdev_ioctl
With the raw driver gone, there is no modular user left.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012104450.659013-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sat, 16 Oct 2021 13:34:49 +0000 (07:34 -0600)]
block: don't dereference request after flush insertion
We could have a race here, where the request gets freed before we call
into blk_mq_run_hw_queue(). If this happens, we cannot rely on the state
of the request.
Grab the hardware context before inserting the flush.
Fixes:
0f38d7664615 ("blk-mq: cleanup blk_mq_submit_bio")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 10:40:45 +0000 (12:40 +0200)]
blk-mq: cleanup blk_mq_submit_bio
Move the blk_mq_alloc_data stack allocation only into the branch
that actually needs it, and use rq->mq_hctx instead of data.hctx
to refer to the hctx.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012104045.658051-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 Oct 2021 10:40:44 +0000 (12:40 +0200)]
blk-mq: cleanup and rename __blk_mq_alloc_request
The newly added loop for the cached requests in __blk_mq_alloc_request
is a little too convoluted for my taste, so unwind it a bit. Also
rename the function to __blk_mq_alloc_requests now that it can allocate
more than a single request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012104045.658051-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 6 Oct 2021 12:34:11 +0000 (06:34 -0600)]
block: pre-allocate requests if plug is started and is a batch
The caller typically has a good (or even exact) idea of how many requests
it needs to submit. We can make the request/tag allocation a lot more
efficient if we just allocate N requests/tags upfront when we queue the
first bio from the batch.
Provide a new plug start helper that allows the caller to specify how many
IOs are expected. This sets plug->nr_ios, and we can use that for smarter
request allocation. The plug provides a holding spot for requests, and
request allocation will check it before calling into the normal request
allocation path.
The blk_finish_plug() is called, check if there are unused requests and
free them. This should not happen in normal operations. The exception is
if we get merging, then we may be left with requests that need freeing
when done.
This raises the per-core performance on my setup from ~5.8M to ~6.1M
IOPS.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 6 Oct 2021 18:01:07 +0000 (12:01 -0600)]
block: bump max plugged deferred size from 16 to 32
Particularly for NVMe with efficient deferred submission for many
requests, there are nice benefits to be seen by bumping the default max
plug count from 16 to 32. This is especially true for virtualized setups,
where the submit part is more expensive. But can be noticed even on
native hardware.
Reduce the multiple queue factor from 4 to 2, since we're changing the
default size.
While changing it, move the defines into the block layer private header.
These aren't values that anyone outside of the block layer uses, or
should use.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 5 Oct 2021 15:23:59 +0000 (09:23 -0600)]
block: inherit request start time from bio for BLK_CGROUP
Doing high IOPS testing with blk-cgroups enabled spends ~15-20% of the
time just doing ktime_get_ns() -> readtsc. We essentially read and
set the start time twice, one for the bio and then again when that bio
is mapped to a request.
Given that the time between the two is very short, inherit the bio
start time instead of reading it again. This cuts 1/3rd of the overhead
of the time keeping.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 5 Oct 2021 15:11:56 +0000 (09:11 -0600)]
block: move blk-throtl fast path inline
Even if no policies are defined, we spend ~2% of the total IO time
checking. Move the fast path inline.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
John Garry [Tue, 5 Oct 2021 10:23:39 +0000 (18:23 +0800)]
blk-mq: Change shared sbitmap naming to shared tags
Now that shared sbitmap support really means shared tags, rename symbols
to match that.
Signed-off-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/1633429419-228500-15-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
John Garry [Tue, 5 Oct 2021 10:23:38 +0000 (18:23 +0800)]
blk-mq: Stop using pointers for blk_mq_tags bitmap tags
Now that we use shared tags for shared sbitmap support, we don't require
the tags sbitmap pointers, so drop them.
This essentially reverts commit
222a5ae03cdd ("blk-mq: Use pointers for
blk_mq_tags bitmap tags").
Function blk_mq_init_bitmap_tags() is removed also, since it would be only
a wrappper for blk_mq_init_bitmaps().
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/1633429419-228500-14-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
John Garry [Tue, 5 Oct 2021 10:23:37 +0000 (18:23 +0800)]
blk-mq: Use shared tags for shared sbitmap support
Currently we use separate sbitmap pairs and active_queues atomic_t for
shared sbitmap support.
However a full sets of static requests are used per HW queue, which is
quite wasteful, considering that the total number of requests usable at
any given time across all HW queues is limited by the shared sbitmap depth.
As such, it is considerably more memory efficient in the case of shared
sbitmap to allocate a set of static rqs per tag set or request queue, and
not per HW queue.
So replace the sbitmap pairs and active_queues atomic_t with a shared
tags per tagset and request queue, which will hold a set of shared static
rqs.
Since there is now no valid HW queue index to be passed to the blk_mq_ops
.init and .exit_request callbacks, pass an invalid index token. This
changes the semantics of the APIs, such that the callback would need to
validate the HW queue index before using it. Currently no user of shared
sbitmap actually uses the HW queue index (as would be expected).
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1633429419-228500-13-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
John Garry [Tue, 5 Oct 2021 10:23:36 +0000 (18:23 +0800)]
blk-mq: Refactor and rename blk_mq_free_map_and_{requests->rqs}()
Refactor blk_mq_free_map_and_requests() such that it can be used at many
sites at which the tag map and rqs are freed.
Also rename to blk_mq_free_map_and_rqs(), which is shorter and matches the
alloc equivalent.
Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-12-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
John Garry [Tue, 5 Oct 2021 10:23:35 +0000 (18:23 +0800)]
blk-mq: Add blk_mq_alloc_map_and_rqs()
Add a function to combine allocating tags and the associated requests,
and factor out common patterns to use this new function.
Some function only call blk_mq_alloc_map_and_rqs() now, but more
functionality will be added later.
Also make blk_mq_alloc_rq_map() and blk_mq_alloc_rqs() static since they
are only used in blk-mq.c, and finally rename some functions for
conciseness and consistency with other function names:
- __blk_mq_alloc_map_and_{request -> rqs}()
- blk_mq_alloc_{map_and_requests -> set_map_and_rqs}()
Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1633429419-228500-11-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
John Garry [Tue, 5 Oct 2021 10:23:34 +0000 (18:23 +0800)]
blk-mq: Add blk_mq_tag_update_sched_shared_sbitmap()
Put the functionality to update the sched shared sbitmap size in a common
function.
Since the same formula is always used to resize, and it can be got from
the request queue argument, so just pass the request queue pointer.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-10-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
John Garry [Tue, 5 Oct 2021 10:23:33 +0000 (18:23 +0800)]
blk-mq: Don't clear driver tags own mapping
Function blk_mq_clear_rq_mapping() is required to clear the sched tags
mappings in driver tags rqs[].
But there is no need for a driver tags to clear its own mapping, so skip
clearing the mapping in this scenario.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1633429419-228500-9-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
John Garry [Tue, 5 Oct 2021 10:23:32 +0000 (18:23 +0800)]
blk-mq: Pass driver tags to blk_mq_clear_rq_mapping()
Function blk_mq_clear_rq_mapping() will be used for shared sbitmap tags
in future, so pass a driver tags pointer instead of the tagset container
and HW queue index.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1633429419-228500-8-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
John Garry [Tue, 5 Oct 2021 10:23:31 +0000 (18:23 +0800)]
blk-mq-sched: Rename blk_mq_sched_free_{requests -> rqs}()
To be more concise and consistent in naming, rename
blk_mq_sched_free_requests() -> blk_mq_sched_free_rqs().
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1633429419-228500-7-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
John Garry [Tue, 5 Oct 2021 10:23:30 +0000 (18:23 +0800)]
blk-mq-sched: Rename blk_mq_sched_alloc_{tags -> map_and_rqs}()
Function blk_mq_sched_alloc_tags() does same as
__blk_mq_alloc_map_and_request(), so give a similar name to be consistent.
Similarly rename label err_free_tags -> err_free_map_and_rqs.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-6-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
John Garry [Tue, 5 Oct 2021 10:23:29 +0000 (18:23 +0800)]
blk-mq: Invert check in blk_mq_update_nr_requests()
It's easier to read:
if (x)
X;
else
Y;
over:
if (!x)
Y;
else
X;
No functional change intended.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-5-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
John Garry [Tue, 5 Oct 2021 10:23:28 +0000 (18:23 +0800)]
blk-mq: Relocate shared sbitmap resize in blk_mq_update_nr_requests()
For shared sbitmap, if the call to blk_mq_tag_update_depth() was
successful for any hctx when hctx->sched_tags is not set, then it would be
successful for all (due to nature in which blk_mq_tag_update_depth()
fails).
As such, there is no need to call blk_mq_tag_resize_shared_sbitmap() for
each hctx. So relocate the call until after the hctx iteration under the
!q->elevator check, which is equivalent (to !hctx->sched_tags).
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-4-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>