platform/kernel/linux-rpi.git
3 years agoio_uring: don't pass tail into io_free_batch_list
Pavel Begunkov [Fri, 24 Sep 2021 20:59:54 +0000 (21:59 +0100)]
io_uring: don't pass tail into io_free_batch_list

io_free_batch_list() iterates all requests in the passed in list,
so we don't really need to know the tail but can keep iterating until
meet NULL. Just pass the first node into it and it will be enough.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4a12c84b6d887d980e05f417ba4172d04c64acae.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: inline completion batching helpers
Pavel Begunkov [Fri, 24 Sep 2021 20:59:53 +0000 (21:59 +0100)]
io_uring: inline completion batching helpers

We now have a single function for batched put of requests, just inline
struct req_batch and all related helpers into it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/595a2917f80dd94288cd7203052c7934f5446580.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: optimise batch completion
Pavel Begunkov [Fri, 24 Sep 2021 20:59:52 +0000 (21:59 +0100)]
io_uring: optimise batch completion

First, convert rest of iopoll bits to single linked lists, and also
replace per-request list_add_tail() with splicing a part of slist.

With that, use io_free_batch_list() to put/free requests. The main
advantage of it is that it's now the only user of struct req_batch and
friends, and so they can be inlined. The main overhead there was
per-request call to not-inlined io_req_free_batch(), which is expensive
enough.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b37fc6d5954b241e025eead7ab92c6f44a42f229.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: convert iopoll_completed to store_release
Pavel Begunkov [Fri, 24 Sep 2021 20:59:51 +0000 (21:59 +0100)]
io_uring: convert iopoll_completed to store_release

Convert explicit barrier around iopoll_completed to smp_load_acquire()
and smp_store_release(). Similar on the callback side, but replaces a
single smp_rmb() with per-request smp_load_acquire(), neither imply any
extra CPU ordering for x86. Use READ_ONCE as usual where it doesn't
matter.

Use it to move filling CQEs by iopoll earlier, that will be necessary
to avoid traversing the list one extra time in the future.

Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8bd663cb15efdc72d6247c38ee810964e744a450.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: add a helper for batch free
Pavel Begunkov [Fri, 24 Sep 2021 20:59:50 +0000 (21:59 +0100)]
io_uring: add a helper for batch free

Add a helper io_free_batch_list(), which takes a single linked list and
puts/frees all requests from it in an efficient manner. Will be reused
later.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4fc8306b542c6b1dd1d08e8021ef3bdb0ad15010.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: use single linked list for iopoll
Pavel Begunkov [Fri, 24 Sep 2021 20:59:49 +0000 (21:59 +0100)]
io_uring: use single linked list for iopoll

Use single linked lists for keeping iopoll requests, takes less space,
may be faster, but mostly will be of benefit for further patches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/314033676b100cd485518c3bc55e1b95a0dcd71f.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: split iopoll loop
Pavel Begunkov [Fri, 24 Sep 2021 20:59:48 +0000 (21:59 +0100)]
io_uring: split iopoll loop

The main loop of io_do_iopoll() iterates and does ->iopoll() until it
meets a first completed request, then it continues from that position
and splices requests to pass them through io_iopoll_complete().

Split the loop in two for clearness, iopolling and reaping completed
requests from the list.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a7f6fd27a94845e5dc925a47a4a9765a92e514fb.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: replace list with stack for req caches
Pavel Begunkov [Fri, 24 Sep 2021 20:59:47 +0000 (21:59 +0100)]
io_uring: replace list with stack for req caches

Replace struct list_head free_list serving for caching requests with
singly linked stack, which is faster.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1bc942b82422fb2624b8353bd93aca183a022846.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio-wq: add io_wq_work_node based stack
Pavel Begunkov [Fri, 24 Sep 2021 20:59:46 +0000 (21:59 +0100)]
io-wq: add io_wq_work_node based stack

Apart from just using lists (i.e. io_wq_work_list), we also want to have
stacks, which are a bit faster, and have some interoperability between
them. Add a stack implementation based on io_wq_work_node and some
helpers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5d3a412a5ac0d47e0f0499d70d2207d70a68925e.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: remove allocation cache array
Pavel Begunkov [Fri, 24 Sep 2021 20:59:45 +0000 (21:59 +0100)]
io_uring: remove allocation cache array

We have several of request allocation layers, remove the last one, which
is the submit->reqs array, and always use submit->free_reqs instead.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8547095c35f7a87bab14f6447ecd30a273ed7500.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: use slist for completion batching
Pavel Begunkov [Fri, 24 Sep 2021 20:59:44 +0000 (21:59 +0100)]
io_uring: use slist for completion batching

Currently we collect requests for completion batching in an array.
Replace them with a singly linked list. It's as fast as arrays but
doesn't take some much space in ctx, and will be used in future patches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a666826f2854d17e9fb9417fb302edfeb750f425.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: make io_do_iopoll return number of reqs
Pavel Begunkov [Fri, 24 Sep 2021 20:59:43 +0000 (21:59 +0100)]
io_uring: make io_do_iopoll return number of reqs

Don't pass nr_events pointer around but return directly, it's less
expensive than pointer increments.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f771a8153a86f16f12ff4272524e9e549c5de40b.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: force_nonspin
Pavel Begunkov [Fri, 24 Sep 2021 20:59:42 +0000 (21:59 +0100)]
io_uring: force_nonspin

We don't really need to pass the number of requests to complete into
io_do_iopoll(), a flag whether to enforce non-spin mode is enough.

Should be straightforward, maybe except io_iopoll_check(). We pass !min
there, because we do never enter with the number of already reaped
requests is larger than the specified @min, apart from the first
iteration, where nr_events is 0 and so the final check should be
identical.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/782b39d1d8ec584eae15bca0a1feb6f0571fe5b8.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: mark having different creds unlikely
Pavel Begunkov [Fri, 24 Sep 2021 20:59:41 +0000 (21:59 +0100)]
io_uring: mark having different creds unlikely

Hint the compiler that it's not as likely to have creds different from
current attached to a request. The current code generation is far from
ideal, hopefully it can help to some compilers to remove duplicated jump
tables and so.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e7815251ac4bf5a4a23d298c752f029ae19f3837.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: return boolean value for io_alloc_async_data
Hao Xu [Wed, 22 Sep 2021 10:15:22 +0000 (18:15 +0800)]
io_uring: return boolean value for io_alloc_async_data

boolean value is good enough for io_alloc_async_data.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210922101522.9179-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: optimise io_req_init() sqe flags checks
Pavel Begunkov [Wed, 15 Sep 2021 11:03:38 +0000 (12:03 +0100)]
io_uring: optimise io_req_init() sqe flags checks

IOSQE_IO_DRAIN is quite marginal and we don't care too much about
IOSQE_BUFFER_SELECT. Save to ifs and hide both of them under
SQE_VALID_FLAGS check. Now we first check whether it uses a "safe"
subset, i.e. without DRAIN and BUFFER_SELECT, and only if it's not
true we test the rest of the flags.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/dccfb9ab2ab0969a2d8dc59af88fa0ce44eeb1d5.1631703764.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: remove ctx referencing from complete_post
Pavel Begunkov [Wed, 15 Sep 2021 11:04:20 +0000 (12:04 +0100)]
io_uring: remove ctx referencing from complete_post

Now completions are done from task context, that means that it's either
the task itself, task_work or io-wq worker. In all those cases the ctx
will be staying alive by mutexing, explicit referencing or req references
by iowq. Remove extra ctx pinning from io_req_complete_post().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/60a0e96434c16ab4fe587651448290d61ec9a113.1631703756.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: add more uring info to fdinfo for debug
Hao Xu [Mon, 13 Sep 2021 13:08:54 +0000 (21:08 +0800)]
io_uring: add more uring info to fdinfo for debug

Developers may need some uring info to help themselves debug and address
issues in production. This includes sqring/cqring head/tail and the
detailed sqe/cqe info, which is very useful when an application is hung
on a ring.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210913130854.38542-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: kill extra wake_up_process in tw add
Pavel Begunkov [Wed, 8 Sep 2021 15:40:53 +0000 (16:40 +0100)]
io_uring: kill extra wake_up_process in tw add

TWA_SIGNAL already wakes the thread, no need in wake_up_process() after
it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7e90cf643f633e857443e0c9e72471b221735c50.1631115443.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: dedup CQE flushing non-empty checks
Pavel Begunkov [Wed, 8 Sep 2021 15:40:52 +0000 (16:40 +0100)]
io_uring: dedup CQE flushing non-empty checks

We don't do io_submit_flush_completions() when there is no requests
enqueued, and every single caller checks for it. Hide that check into
the function not forgetting about inlining. That will make it much
easier for changing the empty check condition in the future.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d7ff8cef5da1b38e8ea648f5aad9a315ddfc7b57.1631115443.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: inline linked part of io_req_find_next
Pavel Begunkov [Wed, 8 Sep 2021 15:40:51 +0000 (16:40 +0100)]
io_uring: inline linked part of io_req_find_next

Inline part of __io_req_find_next() that returns a request but doesn't
need io_disarm_next(). It's just two places, but makes links a bit
faster.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4126d13f23d0e91b39b3558e16bd86cafa7fcef2.1631115443.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: inline io_dismantle_req
Pavel Begunkov [Wed, 8 Sep 2021 15:40:50 +0000 (16:40 +0100)]
io_uring: inline io_dismantle_req

io_dismantle_req() is hot, and not _too_ huge. Inline it, there are 3
call sites, which hopefully will turn into 2 in the future.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/bdd2dc30716cac270c2403e99bccd6286e4ae201.1631115443.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: kill off ios_left
Pavel Begunkov [Wed, 8 Sep 2021 15:40:49 +0000 (16:40 +0100)]
io_uring: kill off ios_left

->ios_left is only used to decide whether to plug or not, kill it to
avoid this extra accounting, just use the initial submission number.
There is no much difference in regards of enabling plugging, where this
one does it in a few more cases, but all major ones should be covered
well.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f13993bcf5b477f9a7d52881fc49f9457ea9870a.1631115443.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio-wq: Remove duplicate code in io_workqueue_create()
Bixuan Cui [Sat, 11 Sep 2021 08:58:47 +0000 (16:58 +0800)]
io-wq: Remove duplicate code in io_workqueue_create()

While task_work_add() in io_workqueue_create() is true,
then duplicate code is executed:

  -> clear_bit_unlock(0, &worker->create_state);
  -> io_worker_release(worker);
  -> atomic_dec(&acct->nr_running);
  -> io_worker_ref_put(wq);
  -> return false;

  -> clear_bit_unlock(0, &worker->create_state); // back to io_workqueue_create()
  -> io_worker_release(worker);
  -> kfree(worker);

The io_worker_release() and clear_bit_unlock() are executed twice.

Fixes: 3146cba99aa2 ("io-wq: make worker creation resilient against signals")
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Link: https://lore.kernel.org/r/20210911085847.34849-1-cuibixuan@huawei.com
Reviwed-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: dump sqe contents if issue fails
Jens Axboe [Sat, 11 Sep 2021 22:04:50 +0000 (16:04 -0600)]
io_uring: dump sqe contents if issue fails

I recently had to look at a production problem where a request ended
up getting the dreaded -EINVAL error on submit. The most used and
hence useless of error codes, as it just tells you that something
was wrong with your request, but not more than that.

Let's dump the full sqe contents if we run into an issue failure,
that'll allow easier diagnosing of a wide variety of issues.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: fix too broad elevator check in blk_mq_free_request()
Jens Axboe [Tue, 19 Oct 2021 02:54:39 +0000 (20:54 -0600)]
block: fix too broad elevator check in blk_mq_free_request()

We added RQF_ELV to tell whether there's an IO scheduler attached, and
RQF_ELVPRIV tells us whether there's an IO scheduler with private data
attached. Don't check RQF_ELV in blk_mq_free_request(), what we care
about here is just if we have scheduler private data attached.

This fixes a boot crash

Fixes: 2ff0682da6e0 ("block: store elevator state in request")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Reported-by: syzbot+eb8104072aeab6cc1195@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonvme: wire up completion batching for the IRQ path
Jens Axboe [Mon, 18 Oct 2021 14:45:39 +0000 (08:45 -0600)]
nvme: wire up completion batching for the IRQ path

Trivial to do now, just need our own io_comp_batch on the stack and pass
that in to the usual command completion handling.

I pondered making this dependent on how many entries we had to process,
but even for a single entry there's no discernable difference in
performance or latency. Running a sync workload over io_uring:

t/io_uring -b512 -d1 -s1 -c1 -p0 -F1 -B1 -n2 /dev/nvme1n1 /dev/nvme2n1

yields the below performance before the patch:

IOPS=254820, BW=124MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=251174, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=250806, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)

and the following after:

IOPS=255972, BW=124MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=251920, BW=123MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=251794, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)

which definitely isn't slower, about the same if you factor in a bit of
variance. For peak performance workloads, benchmarking shows a 2%
improvement.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: utilize the io batching infrastructure for more efficient polled IO
Jens Axboe [Tue, 12 Oct 2021 15:28:46 +0000 (09:28 -0600)]
io_uring: utilize the io batching infrastructure for more efficient polled IO

Wire up using an io_comp_batch for f_op->iopoll(). If the lower stack
supports it, we can handle high rates of polled IO more efficiently.

This raises the single core efficiency on my system from ~6.1M IOPS to
~6.6M IOPS running a random read workload at depth 128 on two gen2
Optane drives.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonvme: add support for batched completion of polled IO
Jens Axboe [Fri, 8 Oct 2021 11:59:37 +0000 (05:59 -0600)]
nvme: add support for batched completion of polled IO

Take advantage of struct io_comp_batch, if passed in to the nvme poll
handler. If it's set, rather than complete each request individually
inline, store them in the io_comp_batch list. We only do so for requests
that will complete successfully, anything else will be completed inline as
before.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: add support for blk_mq_end_request_batch()
Jens Axboe [Fri, 8 Oct 2021 11:50:46 +0000 (05:50 -0600)]
block: add support for blk_mq_end_request_batch()

Instead of calling blk_mq_end_request() on a single request, add a helper
that takes the new struct io_comp_batch and completes any request stored
in there.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agosbitmap: add helper to clear a batch of tags
Jens Axboe [Fri, 8 Oct 2021 11:44:23 +0000 (05:44 -0600)]
sbitmap: add helper to clear a batch of tags

sbitmap currently only supports clearing tags one-by-one, add a helper
that allows the caller to pass in an array of tags to clear.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: add a struct io_comp_batch argument to fops->iopoll()
Jens Axboe [Tue, 12 Oct 2021 15:24:29 +0000 (09:24 -0600)]
block: add a struct io_comp_batch argument to fops->iopoll()

struct io_comp_batch contains a list head and a completion handler, which
will allow completions to more effciently completed batches of IO.

For now, no functional changes in this patch, we just define the
io_comp_batch structure and add the argument to the file_operations iopoll
handler.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: provide helpers for rq_list manipulation
Jens Axboe [Wed, 13 Oct 2021 13:58:52 +0000 (07:58 -0600)]
block: provide helpers for rq_list manipulation

Instead of open-coding the list additions, traversal, and removal,
provide a basic set of helpers.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: remove some blk_mq_hw_ctx debugfs entries
Jens Axboe [Mon, 18 Oct 2021 14:53:19 +0000 (08:53 -0600)]
block: remove some blk_mq_hw_ctx debugfs entries

Just like the blk_mq_ctx counterparts, we've got a bunch of counters
in here that are only for debugfs and are of questionnable value. They
are:

- dispatched, index of how many requests were dispatched in one go

- poll_{considered,invoked,success}, which track poll sucess rates. We're
  confident in the iopoll implementation at this point, don't bother
  tracking these.

As a bonus, this shrinks each hardware queue from 576 bytes to 512 bytes,
dropping a whole cacheline.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: remove debugfs blk_mq_ctx dispatched/merged/completed attributes
Jens Axboe [Sat, 16 Oct 2021 23:27:20 +0000 (17:27 -0600)]
block: remove debugfs blk_mq_ctx dispatched/merged/completed attributes

These were added as part of early days debugging for blk-mq, and they
are not really useful anymore. Rather than spend cycles updating them,
just get rid of them.

As a bonus, this shrinks the per-cpu software queue size from 256b
to 192b. That's a whole cacheline less.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: cache rq_flags inside blk_mq_rq_ctx_init()
Pavel Begunkov [Mon, 18 Oct 2021 20:37:29 +0000 (21:37 +0100)]
block: cache rq_flags inside blk_mq_rq_ctx_init()

Add a local variable for rq_flags, it helps to compile out some of
rq_flags reloads.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: blk_mq_rq_ctx_init cache ctx/q/hctx
Pavel Begunkov [Mon, 18 Oct 2021 20:37:28 +0000 (21:37 +0100)]
block: blk_mq_rq_ctx_init cache ctx/q/hctx

We should have enough of registers in blk_mq_rq_ctx_init(), store them
in local vars, so we don't keep reloading them.

note: keeping q->elevator may look unnecessary, but it's also used
inside inlined blk_mq_tags_from_data().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: skip elevator fields init for non-elv queue
Pavel Begunkov [Mon, 18 Oct 2021 20:37:27 +0000 (21:37 +0100)]
block: skip elevator fields init for non-elv queue

Don't init rq->hash and rq->rb_node in blk_mq_rq_ctx_init() if there is
no elevator. Also, move some other initialisers that imply barriers to
the end, so the compiler is free to rearrange and optimise other the
rest of them.

note: fold in a change from Jens leaving queue_list unconditional, as
it might lead to problems otherwise.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: store elevator state in request
Jens Axboe [Fri, 15 Oct 2021 15:44:38 +0000 (09:44 -0600)]
block: store elevator state in request

Add an rq private RQF_ELV flag, which tells the block layer that this
request was initialized on a queue that has an IO scheduler attached.
This allows for faster checking in the fast path, rather than having to
deference rq->q later on.

Elevator switching does full quiesce of the queue before detaching an
IO scheduler, so it's safe to cache this in the request itself.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: only mark bio as tracked if it really is tracked
Jens Axboe [Sat, 16 Oct 2021 02:06:18 +0000 (20:06 -0600)]
block: only mark bio as tracked if it really is tracked

We set BIO_TRACKED unconditionally when rq_qos_throttle() is called, even
though we may not even have an rq_qos handler. Only mark it as TRACKED if
it really is potentially tracked.

This saves considerable time for the case where the bio isn't tracked:

     2.64%     -1.65%  [kernel.vmlinux]  [k] bio_endio

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: improve layout of struct request
Jens Axboe [Fri, 15 Oct 2021 21:03:52 +0000 (15:03 -0600)]
block: improve layout of struct request

It's been a while since this was analyzed, move some members around to
better flow with the use case. Initial state up top, and queued state
after that. This improves my peak case by about 1.5%, from 7750K to
7900K IOPS.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: move update request helpers into blk-mq.c
Jens Axboe [Thu, 14 Oct 2021 15:17:01 +0000 (09:17 -0600)]
block: move update request helpers into blk-mq.c

For some reason we still have them in blk-core, with the rest of the
request completion being in blk-mq. That causes and out-of-line call
for each completion.

Move them into blk-mq.c instead, where they belong.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: remove useless caller argument to print_req_error()
Jens Axboe [Thu, 14 Oct 2021 15:15:40 +0000 (09:15 -0600)]
block: remove useless caller argument to print_req_error()

We have exactly one caller of this, just get rid of adding the useless
function name to the output.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: don't bother iter advancing a fully done bio
Jens Axboe [Wed, 13 Oct 2021 15:01:43 +0000 (09:01 -0600)]
block: don't bother iter advancing a fully done bio

If we're completing nbytes and nbytes is the size of the bio, don't bother
with calling into the iterator increment helpers. Just clear the bio
size and we're done.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: convert the rest of block to bdev_get_queue
Pavel Begunkov [Thu, 14 Oct 2021 14:03:30 +0000 (15:03 +0100)]
block: convert the rest of block to bdev_get_queue

Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached
queue pointer and so is faster.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/addf6ea988c04213697ba3684c853e4ed7642a39.1634219547.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: use bdev_get_queue() in blk-core.c
Pavel Begunkov [Thu, 14 Oct 2021 14:03:29 +0000 (15:03 +0100)]
block: use bdev_get_queue() in blk-core.c

Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached
queue pointer and so is faster.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/efc41f880262517c8dc32f932f1b23112f21b255.1634219547.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: use bdev_get_queue() in bio.c
Pavel Begunkov [Thu, 14 Oct 2021 14:03:28 +0000 (15:03 +0100)]
block: use bdev_get_queue() in bio.c

Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached
queue pointer and so is faster.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/85c36ea784d285a5075baa10049e6b59e15fb484.1634219547.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: use bdev_get_queue() in bdev.c
Pavel Begunkov [Thu, 14 Oct 2021 14:03:27 +0000 (15:03 +0100)]
block: use bdev_get_queue() in bdev.c

Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached
queue pointer and so is faster.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a352936ce5d9ac719645b1e29b173d931ebcdc02.1634219547.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: cache request queue in bdev
Pavel Begunkov [Thu, 14 Oct 2021 14:03:26 +0000 (15:03 +0100)]
block: cache request queue in bdev

There are tons of places where we need to get a request_queue only
having bdev, which turns into bdev->bd_disk->queue. There are probably a
hundred of such places considering inline helpers, and enough of them
are in hot paths.

Cache queue pointer in struct block_device and make use of it in
bdev_get_queue().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a3bfaecdd28956f03629d0ca5c63ebc096e1c809.1634219547.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: handle fast path of bio splitting inline
Jens Axboe [Wed, 13 Oct 2021 18:43:41 +0000 (12:43 -0600)]
block: handle fast path of bio splitting inline

The fast path is no splitting needed. Separate the handling into a
check part we can inline, and an out-of-line handling path if we do
need to split.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: use flags instead of bit fields for blkdev_dio
Jens Axboe [Thu, 14 Oct 2021 17:17:43 +0000 (11:17 -0600)]
block: use flags instead of bit fields for blkdev_dio

This generates a lot better code for me, and bumps performance from
7650K IOPS to 7750K IOPS. Looking at profiles for the run and running
perf diff, it confirms that we're now sending a lot less time there:

     6.38%     -2.80%  [kernel.vmlinux]  [k] blkdev_direct_IO

Taking it from the 2nd most cycle consumer to only the 9th most at
3.35% of the CPU time.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: cache bdev in struct file for raw bdev IO
Pavel Begunkov [Wed, 13 Oct 2021 08:57:11 +0000 (09:57 +0100)]
block: cache bdev in struct file for raw bdev IO

bdev = &BDEV_I(file->f_mapping->host)->bdev

Getting struct block_device from a file requires 2 memory dereferences
as illustrated above, that takes a toll on performance, so cache it in
yet unused file->private_data. That gives a noticeable peak performance
improvement.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/8415f9fe12e544b9da89593dfbca8de2b52efe03.1634115360.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonvme-multipath: enable polled I/O
Christoph Hellwig [Tue, 12 Oct 2021 11:12:26 +0000 (13:12 +0200)]
nvme-multipath: enable polled I/O

Set the poll queue flag to enable polling, given that the multipath
node just dispatches the bios to a lower queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: don't allow writing to the poll queue attribute
Christoph Hellwig [Tue, 12 Oct 2021 11:12:25 +0000 (13:12 +0200)]
block: don't allow writing to the poll queue attribute

The poll attribute is a historic artefact from before when we had
explicit poll queues that require driver specific configuration.
Just print a warning when writing to the attribute.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: switch polling to be bio based
Christoph Hellwig [Tue, 12 Oct 2021 11:12:24 +0000 (13:12 +0200)]
block: switch polling to be bio based

Replace the blk_poll interface that requires the caller to keep a queue
and cookie from the submissions with polling based on the bio.

Polling for the bio itself leads to a few advantages:

 - the cookie construction can made entirely private in blk-mq.c
 - the caller does not need to remember the request_queue and cookie
   separately and thus sidesteps their lifetime issues
 - keeping the device and the cookie inside the bio allows to trivially
   support polling BIOs remapping by stacking drivers
 - a lot of code to propagate the cookie back up the submission path can
   be removed entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: define 'struct bvec_iter' as packed
Ming Lei [Tue, 12 Oct 2021 11:12:23 +0000 (13:12 +0200)]
block: define 'struct bvec_iter' as packed

'struct bvec_iter' is embedded into 'struct bio', define it as packed
so that we can get one extra 4bytes for other uses without expanding
bio.

'struct bvec_iter' is often allocated on stack, so making it packed
doesn't affect performance. Also I have run io_uring on both
nvme/null_blk, and not observe performance effect in this way.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: use SLAB_TYPESAFE_BY_RCU for the bio slab
Christoph Hellwig [Tue, 12 Oct 2021 11:12:22 +0000 (13:12 +0200)]
block: use SLAB_TYPESAFE_BY_RCU for the bio slab

This flags ensures that the pages will not be reused for non-bio
allocations before the end of an RCU grace period.  With that we can
safely use a RCU lookup for bio polling as long as we are fine with
occasionally polling the wrong device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: rename REQ_HIPRI to REQ_POLLED
Christoph Hellwig [Tue, 12 Oct 2021 11:12:21 +0000 (13:12 +0200)]
block: rename REQ_HIPRI to REQ_POLLED

Unlike the RWF_HIPRI userspace ABI which is intentionally kept vague,
the bio flag is specific to the polling implementation, so rename and
document it properly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: don't sleep when polling for I/O
Christoph Hellwig [Tue, 12 Oct 2021 11:12:20 +0000 (13:12 +0200)]
io_uring: don't sleep when polling for I/O

There is no point in sleeping for the expected I/O completion timeout
in the io_uring async polling model as we never poll for a specific
I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: replace the spin argument to blk_iopoll with a flags argument
Christoph Hellwig [Tue, 12 Oct 2021 11:12:19 +0000 (13:12 +0200)]
block: replace the spin argument to blk_iopoll with a flags argument

Switch the boolean spin argument to blk_poll to passing a set of flags
instead.  This will allow to control polling behavior in a more fine
grained way.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-10-hch@lst.de
[axboe: adapt to changed io_uring iopoll]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-mq: remove blk_qc_t_valid
Christoph Hellwig [Tue, 12 Oct 2021 11:12:18 +0000 (13:12 +0200)]
blk-mq: remove blk_qc_t_valid

Move the trivial check into the only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-mq: remove blk_qc_t_to_tag and blk_qc_t_is_internal
Christoph Hellwig [Tue, 12 Oct 2021 11:12:17 +0000 (13:12 +0200)]
blk-mq: remove blk_qc_t_to_tag and blk_qc_t_is_internal

Merge both functions into their only caller to keep the blk-mq tag to
blk_qc_t mapping as private as possible in blk-mq.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-mq: factor out a "classic" poll helper
Christoph Hellwig [Tue, 12 Oct 2021 11:12:16 +0000 (13:12 +0200)]
blk-mq: factor out a "classic" poll helper

Factor the code to do the classic full metal polling out of blk_poll into
a separate blk_mq_poll_classic helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-mq: factor out a blk_qc_to_hctx helper
Christoph Hellwig [Tue, 12 Oct 2021 11:12:15 +0000 (13:12 +0200)]
blk-mq: factor out a blk_qc_to_hctx helper

Add a helper to get the hctx from a request_queue and cookie, and fold
the blk_qc_t_to_queue_num helper into it as no other callers are left.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: fix a layering violation in io_iopoll_req_issued
Christoph Hellwig [Tue, 12 Oct 2021 11:12:14 +0000 (13:12 +0200)]
io_uring: fix a layering violation in io_iopoll_req_issued

syscall-level code can't just poke into the details of the poll cookie,
which is private information of the block layer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012111226.760968-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoiomap: don't try to poll multi-bio I/Os in __iomap_dio_rw
Christoph Hellwig [Tue, 12 Oct 2021 11:12:13 +0000 (13:12 +0200)]
iomap: don't try to poll multi-bio I/Os in __iomap_dio_rw

If an iocb is split into multiple bios we can't poll for both.  So don't
bother to even try to poll in that case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: don't try to poll multi-bio I/Os in __blkdev_direct_IO
Christoph Hellwig [Tue, 12 Oct 2021 11:12:12 +0000 (13:12 +0200)]
block: don't try to poll multi-bio I/Os in __blkdev_direct_IO

If an iocb is split into multiple bios we can't poll for both.  So don't
even bother to try to poll in that case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012111226.760968-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agodirect-io: remove blk_poll support
Christoph Hellwig [Tue, 12 Oct 2021 11:12:11 +0000 (13:12 +0200)]
direct-io: remove blk_poll support

The polling support in the legacy direct-io support is a little crufty.
It already doesn't support the asynchronous polling needed for io_uring
polling, and is hard to adopt to upcoming changes in the polling
interfaces.  Given that all the major file systems already use the iomap
direct I/O code, just drop the polling support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: only check previous entry for plug merge attempt
Jens Axboe [Thu, 14 Oct 2021 13:24:07 +0000 (07:24 -0600)]
block: only check previous entry for plug merge attempt

Currently we scan the entire plug list, which is potentially very
expensive. In an IOPS bound workload, we can drive about 5.6M IOPS with
merging enabled, and profiling shows that the plug merge check is the
(by far) most expensive thing we're doing:

  Overhead  Command   Shared Object     Symbol
  +   20.89%  io_uring  [kernel.vmlinux]  [k] blk_attempt_plug_merge
  +    4.98%  io_uring  [kernel.vmlinux]  [k] io_submit_sqes
  +    4.78%  io_uring  [kernel.vmlinux]  [k] blkdev_direct_IO
  +    4.61%  io_uring  [kernel.vmlinux]  [k] blk_mq_submit_bio

Instead of browsing the whole list, just check the previously inserted
entry. That is enough for a naive merge check and will catch most cases,
and for devices that need full merging, the IO scheduler attached to
such devices will do that anyway. The plug merge is meant to be an
inexpensive check to avoid getting a request, but if we repeatedly
scan the list for every single insert, it is very much not a cheap
check.

With this patch, the workload instead runs at ~7.0M IOPS, providing
a 25% improvement. Disabling merging entirely yields another 5%
improvement.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: move CONFIG_BLOCK guard to top Makefile
Masahiro Yamada [Mon, 27 Sep 2021 14:00:00 +0000 (23:00 +0900)]
block: move CONFIG_BLOCK guard to top Makefile

Every object under block/ depends on CONFIG_BLOCK.

Move the guard to the top Makefile since there is no point to
descend into block/ if CONFIG_BLOCK=n.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210927140000.866249-5-masahiroy@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: move menu "Partition type" to block/partitions/Kconfig
Masahiro Yamada [Mon, 27 Sep 2021 13:59:59 +0000 (22:59 +0900)]
block: move menu "Partition type" to block/partitions/Kconfig

Move the menu to the relevant place.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210927140000.866249-4-masahiroy@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: simplify Kconfig files
Masahiro Yamada [Mon, 27 Sep 2021 13:59:58 +0000 (22:59 +0900)]
block: simplify Kconfig files

Everything under block/ depends on BLOCK. BLOCK_HOLDER_DEPRECATED is
selected from drivers/md/Kconfig, which is entirely dependent on BLOCK.

Extend the 'if BLOCK' ... 'endif' so it covers the whole block/Kconfig.

Also, clean up the definition of BLOCK_COMPAT and BLK_MQ_PCI because
COMPAT and PCI are boolean.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210927140000.866249-3-masahiroy@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: remove redundant =y from BLK_CGROUP dependency
Masahiro Yamada [Mon, 27 Sep 2021 13:59:57 +0000 (22:59 +0900)]
block: remove redundant =y from BLK_CGROUP dependency

CONFIG_BLK_CGROUP is a boolean option, that is, its value is 'y' or 'n'.
The comparison to 'y' is redundant.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210927140000.866249-2-masahiroy@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: improve batched tag allocation
Jens Axboe [Sat, 9 Oct 2021 19:10:39 +0000 (13:10 -0600)]
block: improve batched tag allocation

Add a blk_mq_get_tags() helper, which uses the new sbitmap API for
allocating a batch of tags all at once. This both simplifies the block
code for batched allocation, and it is also more efficient than just
doing repeated calls into __sbitmap_queue_get().

This reduces the sbitmap overhead in peak runs from ~3% to ~1% and
yields a performanc increase from 6.6M IOPS to 6.8M IOPS for a single
CPU core.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agosbitmap: add __sbitmap_queue_get_batch()
Jens Axboe [Sat, 9 Oct 2021 19:02:23 +0000 (13:02 -0600)]
sbitmap: add __sbitmap_queue_get_batch()

The block layer tag allocation batching still calls into sbitmap to get
each tag, but we can improve on that. Add __sbitmap_queue_get_batch(),
which returns a mask of tags all at once, along with an offset for
those tags.

An example return would be 0xff, where bits 0..7 are set, with
tag_offset == 128. The valid tags in this case would be 128..135.

A batch is specific to an individual sbitmap_map, hence it cannot be
larger than that. The requested number of tags is automatically reduced
to the max that can be satisfied with a single map.

On failure, 0 is returned. Caller should fall back to single tag
allocation at that point/

Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-mq: optimise *end_request non-stat path
Pavel Begunkov [Wed, 13 Oct 2021 08:57:13 +0000 (09:57 +0100)]
blk-mq: optimise *end_request non-stat path

We already have a blk_mq_need_time_stamp() check in
__blk_mq_end_request() to get a timestamp, hide all the statistics
accounting under it. It cuts some cycles for requests that don't need
stats, and is free otherwise.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/e0f2ea812e93a8adcd07101212e7d7e70ca304e7.1634115360.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: mark bio_truncate static
Christoph Hellwig [Tue, 12 Oct 2021 16:18:04 +0000 (18:18 +0200)]
block: mark bio_truncate static

bio_truncate is only used in bio.c, so mark it static.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: move bio_get_{first,last}_bvec out of bio.h
Christoph Hellwig [Tue, 12 Oct 2021 16:18:03 +0000 (18:18 +0200)]
block: move bio_get_{first,last}_bvec out of bio.h

bio_get_first_bvec and bio_get_last_bvec are only used in blk-merge.c,
so move them there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: mark __bio_try_merge_page static
Christoph Hellwig [Tue, 12 Oct 2021 16:18:02 +0000 (18:18 +0200)]
block: mark __bio_try_merge_page static

Mark __bio_try_merge_page static and move it up a bit to avoid the need
for a forward declaration.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: move bio_full out of bio.h
Christoph Hellwig [Tue, 12 Oct 2021 16:18:01 +0000 (18:18 +0200)]
block: move bio_full out of bio.h

bio_full is only used in bio.c, so move it there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: fold bio_cur_bytes into blk_rq_cur_bytes
Christoph Hellwig [Tue, 12 Oct 2021 16:18:00 +0000 (18:18 +0200)]
block: fold bio_cur_bytes into blk_rq_cur_bytes

Fold bio_cur_bytes into the only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: move bio_mergeable out of bio.h
Christoph Hellwig [Tue, 12 Oct 2021 16:17:59 +0000 (18:17 +0200)]
block: move bio_mergeable out of bio.h

bio_mergeable is only needed by I/O schedulers, so move it to
blk-mq-sched.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: don't include <linux/ioprio.h> in <linux/bio.h>
Christoph Hellwig [Tue, 12 Oct 2021 16:17:58 +0000 (18:17 +0200)]
block: don't include <linux/ioprio.h> in <linux/bio.h>

bio.h doesn't need any of the definitions from ioprio.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: remove BIO_BUG_ON
Christoph Hellwig [Tue, 12 Oct 2021 16:17:57 +0000 (18:17 +0200)]
block: remove BIO_BUG_ON

BIO_DEBUG is always defined, so just switch the two instances to use
BUG_ON directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-mq: inline hot part of __blk_mq_sched_restart
Pavel Begunkov [Sat, 9 Oct 2021 12:25:42 +0000 (13:25 +0100)]
blk-mq: inline hot part of __blk_mq_sched_restart

Extract a fast check out of __block_mq_sched_restart() and inline it for
performance reasons.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/894abaa0998e5999f2fe18f271e5efdfc2c32bd2.1633781740.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: inline hot paths of blk_account_io_*()
Pavel Begunkov [Sat, 9 Oct 2021 12:25:41 +0000 (13:25 +0100)]
block: inline hot paths of blk_account_io_*()

Extract hot paths of __blk_account_io_start() and
__blk_account_io_done() into inline functions, so we don't always pay
for function calls.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b0662a636bd4cc7b4f84c9d0a41efa46a688ef13.1633781740.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: merge block_ioctl into blkdev_ioctl
Christoph Hellwig [Tue, 12 Oct 2021 10:44:50 +0000 (12:44 +0200)]
block: merge block_ioctl into blkdev_ioctl

Simplify the ioctl path and match the code structure on the compat side.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012104450.659013-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: move the *blkdev_ioctl declarations out of blkdev.h
Christoph Hellwig [Tue, 12 Oct 2021 10:44:49 +0000 (12:44 +0200)]
block: move the *blkdev_ioctl declarations out of blkdev.h

These are only used inside of block/.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012104450.659013-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: unexport blkdev_ioctl
Christoph Hellwig [Tue, 12 Oct 2021 10:44:48 +0000 (12:44 +0200)]
block: unexport blkdev_ioctl

With the raw driver gone, there is no modular user left.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012104450.659013-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: don't dereference request after flush insertion
Jens Axboe [Sat, 16 Oct 2021 13:34:49 +0000 (07:34 -0600)]
block: don't dereference request after flush insertion

We could have a race here, where the request gets freed before we call
into blk_mq_run_hw_queue(). If this happens, we cannot rely on the state
of the request.

Grab the hardware context before inserting the flush.

Fixes: 0f38d7664615 ("blk-mq: cleanup blk_mq_submit_bio")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-mq: cleanup blk_mq_submit_bio
Christoph Hellwig [Tue, 12 Oct 2021 10:40:45 +0000 (12:40 +0200)]
blk-mq: cleanup blk_mq_submit_bio

Move the blk_mq_alloc_data stack allocation only into the branch
that actually needs it, and use rq->mq_hctx instead of data.hctx
to refer to the hctx.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012104045.658051-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-mq: cleanup and rename __blk_mq_alloc_request
Christoph Hellwig [Tue, 12 Oct 2021 10:40:44 +0000 (12:40 +0200)]
blk-mq: cleanup and rename __blk_mq_alloc_request

The newly added loop for the cached requests in __blk_mq_alloc_request
is a little too convoluted for my taste, so unwind it a bit.  Also
rename the function to __blk_mq_alloc_requests now that it can allocate
more than a single request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012104045.658051-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: pre-allocate requests if plug is started and is a batch
Jens Axboe [Wed, 6 Oct 2021 12:34:11 +0000 (06:34 -0600)]
block: pre-allocate requests if plug is started and is a batch

The caller typically has a good (or even exact) idea of how many requests
it needs to submit. We can make the request/tag allocation a lot more
efficient if we just allocate N requests/tags upfront when we queue the
first bio from the batch.

Provide a new plug start helper that allows the caller to specify how many
IOs are expected. This sets plug->nr_ios, and we can use that for smarter
request allocation. The plug provides a holding spot for requests, and
request allocation will check it before calling into the normal request
allocation path.

The blk_finish_plug() is called, check if there are unused requests and
free them. This should not happen in normal operations. The exception is
if we get merging, then we may be left with requests that need freeing
when done.

This raises the per-core performance on my setup from ~5.8M to ~6.1M
IOPS.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: bump max plugged deferred size from 16 to 32
Jens Axboe [Wed, 6 Oct 2021 18:01:07 +0000 (12:01 -0600)]
block: bump max plugged deferred size from 16 to 32

Particularly for NVMe with efficient deferred submission for many
requests, there are nice benefits to be seen by bumping the default max
plug count from 16 to 32. This is especially true for virtualized setups,
where the submit part is more expensive. But can be noticed even on
native hardware.

Reduce the multiple queue factor from 4 to 2, since we're changing the
default size.

While changing it, move the defines into the block layer private header.
These aren't values that anyone outside of the block layer uses, or
should use.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: inherit request start time from bio for BLK_CGROUP
Jens Axboe [Tue, 5 Oct 2021 15:23:59 +0000 (09:23 -0600)]
block: inherit request start time from bio for BLK_CGROUP

Doing high IOPS testing with blk-cgroups enabled spends ~15-20% of the
time just doing ktime_get_ns() -> readtsc. We essentially read and
set the start time twice, one for the bio and then again when that bio
is mapped to a request.

Given that the time between the two is very short, inherit the bio
start time instead of reading it again. This cuts 1/3rd of the overhead
of the time keeping.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: move blk-throtl fast path inline
Jens Axboe [Tue, 5 Oct 2021 15:11:56 +0000 (09:11 -0600)]
block: move blk-throtl fast path inline

Even if no policies are defined, we spend ~2% of the total IO time
checking. Move the fast path inline.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-mq: Change shared sbitmap naming to shared tags
John Garry [Tue, 5 Oct 2021 10:23:39 +0000 (18:23 +0800)]
blk-mq: Change shared sbitmap naming to shared tags

Now that shared sbitmap support really means shared tags, rename symbols
to match that.

Signed-off-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/1633429419-228500-15-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-mq: Stop using pointers for blk_mq_tags bitmap tags
John Garry [Tue, 5 Oct 2021 10:23:38 +0000 (18:23 +0800)]
blk-mq: Stop using pointers for blk_mq_tags bitmap tags

Now that we use shared tags for shared sbitmap support, we don't require
the tags sbitmap pointers, so drop them.

This essentially reverts commit 222a5ae03cdd ("blk-mq: Use pointers for
blk_mq_tags bitmap tags").

Function blk_mq_init_bitmap_tags() is removed also, since it would be only
a wrappper for blk_mq_init_bitmaps().

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/1633429419-228500-14-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-mq: Use shared tags for shared sbitmap support
John Garry [Tue, 5 Oct 2021 10:23:37 +0000 (18:23 +0800)]
blk-mq: Use shared tags for shared sbitmap support

Currently we use separate sbitmap pairs and active_queues atomic_t for
shared sbitmap support.

However a full sets of static requests are used per HW queue, which is
quite wasteful, considering that the total number of requests usable at
any given time across all HW queues is limited by the shared sbitmap depth.

As such, it is considerably more memory efficient in the case of shared
sbitmap to allocate a set of static rqs per tag set or request queue, and
not per HW queue.

So replace the sbitmap pairs and active_queues atomic_t with a shared
tags per tagset and request queue, which will hold a set of shared static
rqs.

Since there is now no valid HW queue index to be passed to the blk_mq_ops
.init and .exit_request callbacks, pass an invalid index token. This
changes the semantics of the APIs, such that the callback would need to
validate the HW queue index before using it. Currently no user of shared
sbitmap actually uses the HW queue index (as would be expected).

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1633429419-228500-13-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-mq: Refactor and rename blk_mq_free_map_and_{requests->rqs}()
John Garry [Tue, 5 Oct 2021 10:23:36 +0000 (18:23 +0800)]
blk-mq: Refactor and rename blk_mq_free_map_and_{requests->rqs}()

Refactor blk_mq_free_map_and_requests() such that it can be used at many
sites at which the tag map and rqs are freed.

Also rename to blk_mq_free_map_and_rqs(), which is shorter and matches the
alloc equivalent.

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-12-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>