Pavel Begunkov [Fri, 12 Feb 2021 11:55:17 +0000 (11:55 +0000)]
io_uring: don't split out consume out of SQE get
Remove io_consume_sqe() and inline it back into io_get_sqe(). It
requires req dealloc on error, but in exchange we get cleaner
io_submit_sqes() and better locality for cached_sq_head.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 12 Feb 2021 03:23:54 +0000 (03:23 +0000)]
io_uring: save ctx put/get for task_work submit
Do a little trick in io_ring_ctx_free() briefly taking uring_lock, that
will wait for everyone currently holding it, so we can skip pinning ctx
with ctx->refs for __io_req_task_submit(), which is executed and loses
its refs/reqs while holding the lock.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 12 Feb 2021 03:23:53 +0000 (03:23 +0000)]
io_uring: don't duplicate io_req_task_queue()
Don't hand code io_req_task_queue() inside of io_async_buf_func(), just
call it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 12 Feb 2021 03:23:52 +0000 (03:23 +0000)]
io_uring: optimise SQPOLL mm/files grabbing
There are two reasons for this. First is to optimise
io_sq_thread_acquire_mm_files() for non-SQPOLL case, which currently do
too many checks and function calls in the hot path, e.g. in
io_init_req().
The second is to not grab mm/files when there are not needed. As
__io_queue_sqe() issues only one request now, we can reuse
io_sq_thread_acquire_mm_files() instead of unconditional acquire
mm/files.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 12 Feb 2021 03:23:51 +0000 (03:23 +0000)]
io_uring: optimise out unlikely link queue
__io_queue_sqe() tries to issue as much requests of a link as it can,
and uses io_put_req_find_next() to extract a next one, targeting inline
completed requests. As now __io_queue_sqe() is always used together with
struct io_comp_state, it leaves next propagation only a small window and
only for async reqs, that doesn't justify its existence.
Remove it, make __io_queue_sqe() to issue only a head request. It
simplifies the code and will allow other optimisations.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 12 Feb 2021 03:23:50 +0000 (03:23 +0000)]
io_uring: take compl state from submit state
Completion and submission states are now coupled together, it's weird to
get one from argument and another from ctx, do it consistently for
io_req_free_batch(). It's also faster as we already have @state cached
in registers.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 11 Feb 2021 18:28:23 +0000 (18:28 +0000)]
io_uring: inline io_complete_rw_common()
__io_complete_rw() casts request to kiocb for it to be immediately
container_of()'ed by io_complete_rw_common(). And the last function's name
doesn't do a great job of illuminating its purposes, so just inline it in
its only user.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 11 Feb 2021 18:28:22 +0000 (18:28 +0000)]
io_uring: move res check out of io_rw_reissue()
We pass return code into io_rw_reissue() only to be able to check if it's
-EAGAIN. That's not the cleanest approach and may prevent inlining of the
non-EAGAIN fast path, so do it at call sites.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 11 Feb 2021 18:28:21 +0000 (18:28 +0000)]
io_uring: simplify iopoll reissuing
Don't stash -EAGAIN'ed iopoll requests into a list to reissue it later,
do it eagerly. It removes overhead on keeping and checking that list,
and allows in case of failure for these requests to be completed through
normal iopoll completion path.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 11 Feb 2021 18:28:20 +0000 (18:28 +0000)]
io_uring: clean up io_req_free_batch_finish()
io_req_free_batch_finish() is final and does not permit struct req_batch
to be reused without re-init. To be more consistent don't clear ->task
there.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 11 Feb 2021 17:48:03 +0000 (10:48 -0700)]
io_uring: move submit side state closer in the ring
We recently added the submit side req cache, but it was placed at the
end of the struct. Move it near the other submission state for better
memory placement, and reshuffle a few other members at the same time.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 11 Feb 2021 14:45:08 +0000 (07:45 -0700)]
io_uring: assign file_slot prior to calling io_sqe_file_register()
We use the assigned slot in io_sqe_file_register(), and a previous
patch moved the assignment to after we have called it. This isn't
super pretty, and will get cleaned up in the future. For now, fix
the regression by restoring the previous assignment/clear of the
file_slot.
Fixes:
ea64ec02b31d ("io_uring: deduplicate file table slot calculation")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Colin Ian King [Wed, 10 Feb 2021 20:00:07 +0000 (20:00 +0000)]
io_uring: remove redundant initialization of variable ret
The variable ret is being initialized with a value that is never read
and it is being updated later with a new value. The initialization is
redundant and can be removed.
Addresses-Coverity: ("Unused value")
Fixes:
b63534c41e20 ("io_uring: re-issue block requests that failed because of resources")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 10 Feb 2021 11:45:42 +0000 (11:45 +0000)]
io_uring: unpark SQPOLL thread for cancelation
We park SQPOLL task before going into io_uring_cancel_files(), so the
task won't run task_works including those that might be important for
the cancellation passes. In this case it's io_poll_remove_one(), which
frees requests via io_put_req_deferred().
Unpark it for while waiting, it's ok as we disable submissions
beforehand, so no new requests will be generated.
INFO: task syz-executor893:8493 blocked for more than 143 seconds.
Call Trace:
context_switch kernel/sched/core.c:4327 [inline]
__schedule+0x90c/0x21a0 kernel/sched/core.c:5078
schedule+0xcf/0x270 kernel/sched/core.c:5157
io_uring_cancel_files fs/io_uring.c:8912 [inline]
io_uring_cancel_task_requests+0xe70/0x11a0 fs/io_uring.c:8979
__io_uring_files_cancel+0x110/0x1b0 fs/io_uring.c:9067
io_uring_files_cancel include/linux/io_uring.h:51 [inline]
do_exit+0x2fe/0x2ae0 kernel/exit.c:780
do_group_exit+0x125/0x310 kernel/exit.c:922
__do_sys_exit_group kernel/exit.c:933 [inline]
__se_sys_exit_group kernel/exit.c:931 [inline]
__x64_sys_exit_group+0x3a/0x50 kernel/exit.c:931
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Cc: stable@vger.kernel.org # 5.5+
Reported-by: syzbot+695b03d82fa8e4901b06@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 10 Feb 2021 03:14:12 +0000 (20:14 -0700)]
io_uring: place ring SQ/CQ arrays under memcg memory limits
Instead of imposing rlimit memlock limits for the rings themselves,
ensure that we account them properly under memcg with __GFP_ACCOUNT.
We retain rlimit memlock for registered buffers, this is just for the
ring arrays themselves.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 9 Feb 2021 20:48:50 +0000 (13:48 -0700)]
io_uring: enable kmemcg account for io_uring requests
This puts io_uring under the memory cgroups accounting and limits for
requests.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 10 Feb 2021 02:53:37 +0000 (19:53 -0700)]
io_uring: enable req cache for IRQ driven IO
This is the last class of requests that cannot utilize the req alloc
cache. Add a per-ctx req cache that is protected by the completion_lock,
and refill our submit side cache when it gets over our batch count.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hao Xu [Fri, 5 Feb 2021 08:34:21 +0000 (16:34 +0800)]
io_uring: fix possible deadlock in io_uring_poll
Abaci reported follow issue:
[ 30.615891] ======================================================
[ 30.616648] WARNING: possible circular locking dependency detected
[ 30.617423] 5.11.0-rc3-next-
20210115 #1 Not tainted
[ 30.618035] ------------------------------------------------------
[ 30.618914] a.out/1128 is trying to acquire lock:
[ 30.619520]
ffff88810b063868 (&ep->mtx){+.+.}-{3:3}, at: __ep_eventpoll_poll+0x9f/0x220
[ 30.620505]
[ 30.620505] but task is already holding lock:
[ 30.621218]
ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0
[ 30.622349]
[ 30.622349] which lock already depends on the new lock.
[ 30.622349]
[ 30.623289]
[ 30.623289] the existing dependency chain (in reverse order) is:
[ 30.624243]
[ 30.624243] -> #1 (&ctx->uring_lock){+.+.}-{3:3}:
[ 30.625263] lock_acquire+0x2c7/0x390
[ 30.625868] __mutex_lock+0xae/0x9f0
[ 30.626451] io_cqring_overflow_flush.part.95+0x6d/0x70
[ 30.627278] io_uring_poll+0xcb/0xd0
[ 30.627890] ep_item_poll.isra.14+0x4e/0x90
[ 30.628531] do_epoll_ctl+0xb7e/0x1120
[ 30.629122] __x64_sys_epoll_ctl+0x70/0xb0
[ 30.629770] do_syscall_64+0x2d/0x40
[ 30.630332] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 30.631187]
[ 30.631187] -> #0 (&ep->mtx){+.+.}-{3:3}:
[ 30.631985] check_prevs_add+0x226/0xb00
[ 30.632584] __lock_acquire+0x1237/0x13a0
[ 30.633207] lock_acquire+0x2c7/0x390
[ 30.633740] __mutex_lock+0xae/0x9f0
[ 30.634258] __ep_eventpoll_poll+0x9f/0x220
[ 30.634879] __io_arm_poll_handler+0xbf/0x220
[ 30.635462] io_issue_sqe+0xa6b/0x13e0
[ 30.635982] __io_queue_sqe+0x10b/0x550
[ 30.636648] io_queue_sqe+0x235/0x470
[ 30.637281] io_submit_sqes+0xcce/0xf10
[ 30.637839] __x64_sys_io_uring_enter+0x3fb/0x5b0
[ 30.638465] do_syscall_64+0x2d/0x40
[ 30.638999] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 30.639643]
[ 30.639643] other info that might help us debug this:
[ 30.639643]
[ 30.640618] Possible unsafe locking scenario:
[ 30.640618]
[ 30.641402] CPU0 CPU1
[ 30.641938] ---- ----
[ 30.642664] lock(&ctx->uring_lock);
[ 30.643425] lock(&ep->mtx);
[ 30.644498] lock(&ctx->uring_lock);
[ 30.645668] lock(&ep->mtx);
[ 30.646321]
[ 30.646321] *** DEADLOCK ***
[ 30.646321]
[ 30.647642] 1 lock held by a.out/1128:
[ 30.648424] #0:
ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0
[ 30.649954]
[ 30.649954] stack backtrace:
[ 30.650592] CPU: 1 PID: 1128 Comm: a.out Not tainted 5.11.0-rc3-next-
20210115 #1
[ 30.651554] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 30.652290] Call Trace:
[ 30.652688] dump_stack+0xac/0xe3
[ 30.653164] check_noncircular+0x11e/0x130
[ 30.653747] ? check_prevs_add+0x226/0xb00
[ 30.654303] check_prevs_add+0x226/0xb00
[ 30.654845] ? add_lock_to_list.constprop.49+0xac/0x1d0
[ 30.655564] __lock_acquire+0x1237/0x13a0
[ 30.656262] lock_acquire+0x2c7/0x390
[ 30.656788] ? __ep_eventpoll_poll+0x9f/0x220
[ 30.657379] ? __io_queue_proc.isra.88+0x180/0x180
[ 30.658014] __mutex_lock+0xae/0x9f0
[ 30.658524] ? __ep_eventpoll_poll+0x9f/0x220
[ 30.659112] ? mark_held_locks+0x5a/0x80
[ 30.659648] ? __ep_eventpoll_poll+0x9f/0x220
[ 30.660229] ? _raw_spin_unlock_irqrestore+0x2d/0x40
[ 30.660885] ? trace_hardirqs_on+0x46/0x110
[ 30.661471] ? __io_queue_proc.isra.88+0x180/0x180
[ 30.662102] ? __ep_eventpoll_poll+0x9f/0x220
[ 30.662696] __ep_eventpoll_poll+0x9f/0x220
[ 30.663273] ? __ep_eventpoll_poll+0x220/0x220
[ 30.663875] __io_arm_poll_handler+0xbf/0x220
[ 30.664463] io_issue_sqe+0xa6b/0x13e0
[ 30.664984] ? __lock_acquire+0x782/0x13a0
[ 30.665544] ? __io_queue_proc.isra.88+0x180/0x180
[ 30.666170] ? __io_queue_sqe+0x10b/0x550
[ 30.666725] __io_queue_sqe+0x10b/0x550
[ 30.667252] ? __fget_files+0x131/0x260
[ 30.667791] ? io_req_prep+0xd8/0x1090
[ 30.668316] ? io_queue_sqe+0x235/0x470
[ 30.668868] io_queue_sqe+0x235/0x470
[ 30.669398] io_submit_sqes+0xcce/0xf10
[ 30.669931] ? xa_load+0xe4/0x1c0
[ 30.670425] __x64_sys_io_uring_enter+0x3fb/0x5b0
[ 30.671051] ? lockdep_hardirqs_on_prepare+0xde/0x180
[ 30.671719] ? syscall_enter_from_user_mode+0x2b/0x80
[ 30.672380] do_syscall_64+0x2d/0x40
[ 30.672901] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 30.673503] RIP: 0033:0x7fd89c813239
[ 30.673962] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 ec 2c 00 f7 d8 64 89 01 48
[ 30.675920] RSP: 002b:
00007ffc65a7c628 EFLAGS:
00000217 ORIG_RAX:
00000000000001aa
[ 30.676791] RAX:
ffffffffffffffda RBX:
0000000000000000 RCX:
00007fd89c813239
[ 30.677594] RDX:
0000000000000000 RSI:
0000000000000014 RDI:
0000000000000003
[ 30.678678] RBP:
00007ffc65a7c720 R08:
0000000000000000 R09:
0000000003000000
[ 30.679492] R10:
0000000000000000 R11:
0000000000000217 R12:
0000000000400ff0
[ 30.680282] R13:
00007ffc65a7c840 R14:
0000000000000000 R15:
0000000000000000
This might happen if we do epoll_wait on a uring fd while reading/writing
the former epoll fd in a sqe in the former uring instance.
So let's don't flush cqring overflow list, just do a simple check.
Reported-by: Abaci <abaci@linux.alibaba.com>
Fixes:
6c503150ae33 ("io_uring: patch up IOPOLL overflow_flush sync")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 10 Feb 2021 00:03:23 +0000 (00:03 +0000)]
io_uring: defer flushing cached reqs
Awhile there are requests in the allocation cache -- use them, only if
those ended go for the stashed memory in comp.free_list. As list
manipulation are generally heavy and are not good for caches, flush them
all or as much as can in one go.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: return success/failure from io_flush_cached_reqs()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 10 Feb 2021 00:03:22 +0000 (00:03 +0000)]
io_uring: take comp_state from ctx
__io_queue_sqe() is always called with a non-NULL comp_state, which is
taken directly from context. Don't pass it around but infer from ctx.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 10 Feb 2021 00:03:21 +0000 (00:03 +0000)]
io_uring: enable req cache for task_work items
task_work is run without utilizing the req alloc cache, so any deferred
items don't get to take advantage of either the alloc or free side of it.
With task_work now being wrapped by io_uring, we can use the ctx
completion state to both use the req cache and the completion flush
batching.
With this, the only request type that cannot take advantage of the req
cache is IRQ driven IO for regular files / block devices. Anything else,
including IOPOLL polled IO to those same tyes, will take advantage of it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 10 Feb 2021 00:03:20 +0000 (00:03 +0000)]
io_uring: provide FIFO ordering for task_work
task_work is a LIFO list, due to how it's implemented as a lockless
list. For long chains of task_work, this can be problematic as the
first entry added is the last one processed. Similarly, we'd waste
a lot of CPU cycles reversing this list.
Wrap the task_work so we have a single task_work entry per task per
ctx, and use that to run it in the right order.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 10 Feb 2021 00:03:19 +0000 (00:03 +0000)]
io_uring: use persistent request cache
Now that we have the submit_state in the ring itself, we can have io_kiocb
allocations that are persistent across invocations. This reduces the time
spent doing slab allocations and frees.
[sil: rebased]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 10 Feb 2021 00:03:18 +0000 (00:03 +0000)]
io_uring: feed reqs back into alloc cache
Make io_req_free_batch(), which is used for inline executed requests and
IOPOLL, to return requests back into the allocation cache, so avoid
most of kmalloc()/kfree() for those cases.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 10 Feb 2021 00:03:17 +0000 (00:03 +0000)]
io_uring: persistent req cache
Don't free batch-allocated requests across syscalls.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 10 Feb 2021 00:03:16 +0000 (00:03 +0000)]
io_uring: count ctx refs separately from reqs
Currently batch free handles request memory freeing and ctx ref putting
together. Separate them and use different counters, that will be needed
for reusing reqs memory.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 10 Feb 2021 00:03:15 +0000 (00:03 +0000)]
io_uring: remove fallback_req
Remove fallback_req for now, it gets in the way of other changes.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 10 Feb 2021 00:03:14 +0000 (00:03 +0000)]
io_uring: submit-completion free batching
io_submit_flush_completions() does completion batching, but may also use
free batching as iopoll does. The main beneficiaries should be buffered
reads/writes and send/recv.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 10 Feb 2021 00:03:13 +0000 (00:03 +0000)]
io_uring: replace list with array for compl batch
Reincarnation of an old patch that replaces a list in struct
io_compl_batch with an array. It's needed to avoid hooking requests via
their compl.list, because it won't be always available in the future.
It's also nice to split io_submit_flush_completions() to avoid free
under locks and remove unlock/lock with a long comment describing when
it can be done.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 10 Feb 2021 00:03:12 +0000 (00:03 +0000)]
io_uring: don't reinit submit state every time
As now submit_state is retained across syscalls, we can save ourself
from initialising it from ground up for each io_submit_sqes(). Set some
fields during ctx allocation, and just keep them always consistent.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: remove unnecessary zeroing of ctx members]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 10 Feb 2021 00:03:11 +0000 (00:03 +0000)]
io_uring: remove ctx from comp_state
completion state is closely bound to ctx, we don't need to store ctx
inside as we always have it around to pass to flush.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 10 Feb 2021 00:03:10 +0000 (00:03 +0000)]
io_uring: don't keep submit_state on stack
struct io_submit_state is quite big (168 bytes) and going to grow. It's
better to not keep it on stack as it is now. Move it to context, it's
always protected by uring_lock, so it's fine to have only one instance
of it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 10 Feb 2021 00:03:09 +0000 (00:03 +0000)]
io_uring: don't propagate io_comp_state
There is no reason to drag io_comp_state into opcode handlers, we just
need a flag and the actual work will be done in __io_queue_sqe().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 10 Feb 2021 00:03:08 +0000 (00:03 +0000)]
io_uring: make op handlers always take issue flags
Make opcode handler interfaces a bit more consistent by always passing
in issue flags. Bulky but pretty easy and mechanical change.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 10 Feb 2021 00:03:07 +0000 (00:03 +0000)]
io_uring: replace force_nonblock with flags
Replace bool force_nonblock with flags. It has a long standing goal of
differentiating context from which we execute. Currently we have some
subtle places where some invariants, like holding of uring_lock, are
subtly inferred.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Sun, 7 Feb 2021 22:34:26 +0000 (22:34 +0000)]
io_uring: cleanup up cancel SQPOLL reqs across exec
For SQPOLL rings tctx_inflight() always returns zero, so it might skip
doing full cancelation. It's fine because we jam all sqpoll submissions
in any case and do go through files cancel for them, but not nice.
Do the intended full cancellation, by mimicking __io_uring_task_cancel()
waiting but impersonating SQPOLL task.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 5 Feb 2021 00:58:00 +0000 (00:58 +0000)]
io_uring: refactor sendmsg/recvmsg iov managing
Current iov handling with recvmsg/sendmsg may be confusing. First make a
rule for msg->iov: either it points to an allocated iov that have to be
kfree()'d later, or it's NULL and we use fast_iov. That's much better
than current 3-state (also can point to fast_iov). And rename it into
free_iov for uniformity with read/write.
Also, instead of after struct io_async_msghdr copy fixing up of
msg.msg_iter.iov has been happening in io_recvmsg()/io_sendmsg(). Move
it into io_setup_async_msg(), that's the right place.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: add comment on NULL check before kfree()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 5 Feb 2021 00:57:59 +0000 (00:57 +0000)]
io_uring: clean iov usage for recvmsg buf select
Don't pretend we don't know that REQ_F_BUFFER_SELECT for recvmsg always
uses fast_iov -- clean up confusing intermixing kmsg->iov and
kmsg->fast_iov for buffer select.
Also don't init iter with garbage in __io_recvmsg_copy_hdr() only for it
to be set shortly after in io_recvmsg().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 5 Feb 2021 00:57:58 +0000 (00:57 +0000)]
io_uring: set msg_name on msg fixup
io_setup_async_msg() should fully prepare io_async_msghdr, let it also
handle assigning msg_name and don't hand code it in [send,recv]msg().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 4 Feb 2021 13:52:08 +0000 (13:52 +0000)]
io_uring/io-wq: return 2-step work swap scheme
Saving one lock/unlock for io-wq is not super important, but adds some
ugliness in the code. More important, atomic decs not turning it to zero
for some archs won't give the right ordering/barriers so the
io_steal_work() may pretty easily get subtly and completely broken.
Return back 2-step io-wq work exchange and clean it up.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 4 Feb 2021 13:52:07 +0000 (13:52 +0000)]
io_uring: deduplicate file table slot calculation
Extract a helper io_fixed_file_slot() returning a place in our fixed
files table, so we don't hand-code it three times in the code.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 4 Feb 2021 13:52:06 +0000 (13:52 +0000)]
io_uring: io_import_iovec return type cleanup
io_import_iovec() doesn't return IO size anymore, only error code. Make
it more apparent by returning int instead of ssize and clean up
leftovers.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 4 Feb 2021 13:52:05 +0000 (13:52 +0000)]
io_uring: treat NONBLOCK and RWF_NOWAIT similarly
Make decision making of whether we need to retry read/write similar for
O_NONBLOCK and RWF_NOWAIT. Set REQ_F_NOWAIT when either is specified and
use it for all relevant checks. Also fix resubmitting NOWAIT requests
via io_rw_reissue().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 4 Feb 2021 13:52:04 +0000 (13:52 +0000)]
io_uring: highlight read-retry loop
We already have implicit do-while for read-retries but with goto in the
end. Convert it to an actual do-while, it highlights it so making a
bit more understandable and is cleaner in general.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 4 Feb 2021 13:52:03 +0000 (13:52 +0000)]
io_uring: inline io_read()'s iovec freeing
io_read() has not the simpliest control flow with a lot of jumps and
it's hard to read. One of those is a out_free: label, which frees iovec.
However, from the middle of io_read() iovec is NULL'ed and so
kfree(iovec) is no-op, it leaves us with two place where we can inline
it and further clean up the code.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 4 Feb 2021 13:52:02 +0000 (13:52 +0000)]
io_uring: don't forget to adjust io_size
We have invariant in io_read() of how much we're trying to read spilled
into an iter and io_size variable. The last one controls decision making
about whether to do read-retries. However, io_size is modified only
after the first read attempt, so if we happen to go for a third retry in
a single call to io_read(), we will get io_size greater than in the
iterator, so may lead to various side effects up to live-locking.
Modify io_size each time.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 4 Feb 2021 13:52:01 +0000 (13:52 +0000)]
io_uring: let io_setup_async_rw take care of iovec
Now we give out ownership of iovec into io_setup_async_rw(), so it
either sets request's context right or frees the iovec on error itself.
Makes our life a bit easier at call sites.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 4 Feb 2021 13:52:00 +0000 (13:52 +0000)]
io_uring: further simplify do_read error parsing
First, instead of checking iov_iter_count(iter) for 0 to find out that
all needed bytes were read, just compare returned code against io_size.
It's more reliable and arguably cleaner.
Also, place the half-read case into an else branch and delete an extra
label.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 4 Feb 2021 13:51:59 +0000 (13:51 +0000)]
io_uring: refactor io_read for unsupported nowait
!io_file_supports_async() case of io_read() is hard to read, it jumps
somewhere in the middle of the function just to do async setup and fail
on a similar check. Call io_setup_async_rw() directly for this case,
it's much easier to follow.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 4 Feb 2021 13:51:58 +0000 (13:51 +0000)]
io_uring: refactor io_cqring_wait
It's easy to make a mistake in io_cqring_wait() because for all
break/continue clauses we need to watch for prepare/finish_wait to be
used correctly. Extract all those into a new helper
io_cqring_wait_schedule(), and transforming the loop into simple series
of func calls: prepare(); check_and_schedule(); finish();
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 4 Feb 2021 13:51:57 +0000 (13:51 +0000)]
io_uring: refactor scheduling in io_cqring_wait
schedule_timeout() with timeout=MAX_SCHEDULE_TIMEOUT is guaranteed to
work just as schedule(), so instead of hand-coding it based on arguments
always use the timeout version and simplify code.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 4 Feb 2021 13:51:56 +0000 (13:51 +0000)]
io_uring: deduplicate core cancellations sequence
Files and task cancellations go over same steps trying to cancel
requests in io-wq, poll, etc. Deduplicate it with a helper.
note: new io_uring_try_cancel_requests() is former
__io_uring_cancel_task_requests() with files passed as an agrument and
flushing overflowed requests.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 1 Feb 2021 18:59:56 +0000 (18:59 +0000)]
io_uring: simplify do_read return parsing
do_read() returning 0 bytes read (not -EAGAIN/etc.) is not an important
enough of a case to prioritise it. Fold it into ret < 0 check, so we get
rid of an extra if and make it a bit more readable.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 1 Feb 2021 18:59:55 +0000 (18:59 +0000)]
io_uring: deduplicate adding to REQ_F_INFLIGHT
We don't know for how long REQ_F_INFLIGHT is going to stay, cleaner to
extract a helper for marking requests as so.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 1 Feb 2021 18:59:54 +0000 (18:59 +0000)]
io_uring: remove work flags after cleanup
Shouldn't be a problem now, but it's better to clean
REQ_F_WORK_INITIALIZED and work->flags only after relevant resources are
killed, so cancellation see them.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 1 Feb 2021 18:59:53 +0000 (18:59 +0000)]
io_uring: inline io_req_drop_files()
req->files now have same lifetime as all other iowq-work resources,
inline io_req_drop_files() for consistency. Moreover, since
REQ_F_INFLIGHT is no more files specific, the function name became
very confusing.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 1 Feb 2021 18:59:52 +0000 (18:59 +0000)]
io_uring: kill not used needs_file_no_error
We have no request types left using needs_file_no_error, remove it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 1 Feb 2021 18:59:51 +0000 (18:59 +0000)]
io_uring: fix inconsistent lock state
WARNING: inconsistent lock state
inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
syz-executor217/8450 [HC1[1]:SC0[0]:HE0:SE1] takes:
ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: spin_lock include/linux/spinlock.h:354 [inline]
ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: io_req_clean_work fs/io_uring.c:1398 [inline]
ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: io_dismantle_req+0x66f/0xf60 fs/io_uring.c:2029
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&fs->lock);
<Interrupt>
lock(&fs->lock);
*** DEADLOCK ***
1 lock held by syz-executor217/8450:
#0:
ffff88802417c3e8 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0x1071/0x1f30 fs/io_uring.c:9442
stack backtrace:
CPU: 1 PID: 8450 Comm: syz-executor217 Not tainted 5.11.0-rc5-next-
20210129-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<IRQ>
[...]
_raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:354 [inline]
io_req_clean_work fs/io_uring.c:1398 [inline]
io_dismantle_req+0x66f/0xf60 fs/io_uring.c:2029
__io_free_req+0x3d/0x2e0 fs/io_uring.c:2046
io_free_req fs/io_uring.c:2269 [inline]
io_double_put_req fs/io_uring.c:2392 [inline]
io_put_req+0xf9/0x570 fs/io_uring.c:2388
io_link_timeout_fn+0x30c/0x480 fs/io_uring.c:6497
__run_hrtimer kernel/time/hrtimer.c:1519 [inline]
__hrtimer_run_queues+0x609/0xe40 kernel/time/hrtimer.c:1583
hrtimer_interrupt+0x334/0x940 kernel/time/hrtimer.c:1645
local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1085 [inline]
__sysvec_apic_timer_interrupt+0x146/0x540 arch/x86/kernel/apic/apic.c:1102
asm_call_irq_on_stack+0xf/0x20
</IRQ>
__run_sysvec_on_irqstack arch/x86/include/asm/irq_stack.h:37 [inline]
run_sysvec_on_irqstack_cond arch/x86/include/asm/irq_stack.h:89 [inline]
sysvec_apic_timer_interrupt+0xbd/0x100 arch/x86/kernel/apic/apic.c:1096
asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:629
RIP: 0010:__raw_spin_unlock_irq include/linux/spinlock_api_smp.h:169 [inline]
RIP: 0010:_raw_spin_unlock_irq+0x25/0x40 kernel/locking/spinlock.c:199
spin_unlock_irq include/linux/spinlock.h:404 [inline]
io_queue_linked_timeout+0x194/0x1f0 fs/io_uring.c:6525
__io_queue_sqe+0x328/0x1290 fs/io_uring.c:6594
io_queue_sqe+0x631/0x10d0 fs/io_uring.c:6639
io_queue_link_head fs/io_uring.c:6650 [inline]
io_submit_sqe fs/io_uring.c:6697 [inline]
io_submit_sqes+0x19b5/0x2720 fs/io_uring.c:6960
__do_sys_io_uring_enter+0x107d/0x1f30 fs/io_uring.c:9443
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Don't free requests from under hrtimer context (softirq) as it may sleep
or take spinlocks improperly (e.g. non-irq versions).
Cc: stable@vger.kernel.org # 5.6+
Reported-by: syzbot+81d17233a2b02eafba33@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dan Carpenter [Mon, 1 Feb 2021 12:23:42 +0000 (15:23 +0300)]
io_uring: Fix NULL dereference in error in io_sqe_files_register()
If we hit a "goto out_free;" before the "ctx->file_data" pointer has
been assigned then it leads to a NULL derefence when we call:
free_fixed_rsrc_data(ctx->file_data);
We can fix this by moving the assignment earlier.
Fixes:
1ad555c6ae6e ("io_uring: create common fixed_rsrc_data allocation routines")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hao Xu [Sun, 31 Jan 2021 14:39:04 +0000 (22:39 +0800)]
io_uring: check kthread parked flag before sqthread goes to sleep
Abaci reported this issue:
#[ 605.170872] INFO: task kworker/u4:1:53 blocked for more than 143 seconds.
[ 605.172123] Not tainted 5.10.0+ #1
[ 605.172811] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 605.173915] task:kworker/u4:1 state:D stack: 0 pid: 53 ppid: 2 flags:0x00004000
[ 605.175130] Workqueue: events_unbound io_ring_exit_work
[ 605.175931] Call Trace:
[ 605.176334] __schedule+0xe0e/0x25a0
[ 605.176971] ? firmware_map_remove+0x1a1/0x1a1
[ 605.177631] ? write_comp_data+0x2a/0x80
[ 605.178272] schedule+0xd0/0x270
[ 605.178811] schedule_timeout+0x6b6/0x940
[ 605.179415] ? mark_lock.part.0+0xca/0x1420
[ 605.180062] ? usleep_range+0x170/0x170
[ 605.180684] ? wait_for_completion+0x16d/0x280
[ 605.181392] ? mark_held_locks+0x9e/0xe0
[ 605.182079] ? rwlock_bug.part.0+0x90/0x90
[ 605.182853] ? lockdep_hardirqs_on_prepare+0x286/0x400
[ 605.183817] wait_for_completion+0x175/0x280
[ 605.184713] ? wait_for_completion_interruptible+0x340/0x340
[ 605.185611] ? _raw_spin_unlock_irq+0x24/0x30
[ 605.186307] ? migrate_swap_stop+0x9c0/0x9c0
[ 605.187046] kthread_park+0x127/0x1c0
[ 605.187738] io_sq_thread_stop+0xd5/0x530
[ 605.188459] io_ring_exit_work+0xb1/0x970
[ 605.189207] process_one_work+0x92c/0x1510
[ 605.189947] ? pwq_dec_nr_in_flight+0x360/0x360
[ 605.190682] ? rwlock_bug.part.0+0x90/0x90
[ 605.191430] ? write_comp_data+0x2a/0x80
[ 605.192207] worker_thread+0x9b/0xe20
[ 605.192900] ? process_one_work+0x1510/0x1510
[ 605.193599] kthread+0x353/0x460
[ 605.194154] ? _raw_spin_unlock_irq+0x24/0x30
[ 605.194910] ? kthread_create_on_node+0x100/0x100
[ 605.195821] ret_from_fork+0x1f/0x30
[ 605.196605]
[ 605.196605] Showing all locks held in the system:
[ 605.197598] 1 lock held by khungtaskd/25:
[ 605.198301] #0:
ffffffff8b5f76a0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.0+0x0/0x30
[ 605.199914] 3 locks held by kworker/u4:1/53:
[ 605.200609] #0:
ffff888100109938 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x82a/0x1510
[ 605.202108] #1:
ffff888100e47dc0 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work+0x85e/0x1510
[ 605.203681] #2:
ffff888116931870 (&sqd->lock){+.+.}-{3:3}, at: io_sq_thread_park.part.0+0x19/0x50
[ 605.205183] 3 locks held by systemd-journal/161:
[ 605.206037] 1 lock held by syslog-ng/254:
[ 605.206674] 2 locks held by agetty/311:
[ 605.207292] #0:
ffff888101097098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x27/0x80
[ 605.208715] #1:
ffffc900000332e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x222/0x1bb0
[ 605.210131] 2 locks held by bash/677:
[ 605.210723] #0:
ffff88810419a098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x27/0x80
[ 605.212105] #1:
ffffc900000512e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x222/0x1bb0
[ 605.213777]
[ 605.214151] =============================================
I believe this is caused by the follow race:
(ctx_list is empty now)
=> io_put_sq_data |
==> kthread_park(sqd->thread); |
====> set KTHREAD_SHOULD_PARK |
====> wake_up_process(k) | sq thread is running
|
|
| needs_sched is true since no ctx,
| so TASK_INTERRUPTIBLE set and schedule
| out then never wake up again
|
====> wait_for_completion |
(stuck here)
So check if sqthread gets park flag right before schedule().
since ctx_list is always empty when this problem happens, here I put
kthread_should_park() before setting the wakeup flag(ctx_list is empty
so this for loop is fast), where is close enough to schedule(). The
problem doesn't show again in my repro testing after this fix.
Reported-by: Abaci <abaci@linux.alibaba.com>
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 27 Jan 2021 01:25:04 +0000 (01:25 +0000)]
MAINTAINERS: update io_uring section
Add the missing kernel io_uring header, add Pavel as a reviewer, and
exclude io_uring from the FILESYSTEMS section to avoid keep spamming Al
(mainly) with bug reports, patches, etc.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
noah [Tue, 26 Jan 2021 20:23:28 +0000 (15:23 -0500)]
io_uring: Add skip option for __io_sqe_files_update
This patch adds support for skipping a file descriptor when using
IORING_REGISTER_FILES_UPDATE. __io_sqe_files_update will skip fds set
to IORING_REGISTER_FILES_SKIP. IORING_REGISTER_FILES_SKIP is inturn
added as a #define in io_uring.h
Signed-off-by: noah <goldstein.w.n@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 26 Jan 2021 13:51:09 +0000 (13:51 +0000)]
io_uring: cleanup files_update looping
Replace a while with a simple for loop, that looks way more natural, and
enables us to use "continue" as indexes are no more updated by hand in
the end of the loop.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 25 Jan 2021 11:42:21 +0000 (11:42 +0000)]
io_uring: consolidate putting reqs task
We grab a task for each request and while putting it it also have to do
extra work like inflight accounting and waking up that task. This
sequence is duplicated several time, it's good time to add a helper.
More to that, the helper generates better code due to better locality
and so not failing alias analysis.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 25 Jan 2021 11:42:20 +0000 (11:42 +0000)]
io_uring: ensure only sqo_task has file notes
For SQPOLL io_uring we want to have only one file note held by
sqo_task. Add a warning to make sure it holds. It's deep in
io_uring_add_task_file() out of hot path, so shouldn't hurt.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yejune Deng [Thu, 24 Dec 2020 03:02:20 +0000 (11:02 +0800)]
io_uring: simplify io_remove_personalities()
The function io_remove_personalities() is very similar to
io_unregister_personality(),so implement io_remove_personalities()
calling io_unregister_personality().
Signed-off-by: Yejune Deng <yejune.deng@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 19 Jan 2021 22:53:54 +0000 (15:53 -0700)]
io_uring/io-wq: kill off now unused IO_WQ_WORK_NO_CANCEL
It's no longer used as IORING_OP_CLOSE got rid for the need of flagging
it as uncancelable, kill it of.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 19 Jan 2021 22:50:37 +0000 (15:50 -0700)]
io_uring: get rid of intermediate IORING_OP_CLOSE stage
We currently split the close into two, in case we have a ->flush op
that we can't safely handle from non-blocking context. This requires
us to flag the op as uncancelable if we do need to punt it async, and
that means special handling for just this op type.
Use __close_fd_get_file() and grab the files lock so we can get the file
and check if we need to go async in one atomic operation. That gets rid
of the need for splitting this into two steps, and hence the need for
IO_WQ_WORK_NO_CANCEL.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 19 Jan 2021 22:41:52 +0000 (15:41 -0700)]
fs: provide locked helper variant of close_fd_get_file()
Assumes current->files->file_lock is already held on invocation. Helps
the caller check the file before removing the fd, if it needs to.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 19 Jan 2021 13:32:47 +0000 (13:32 +0000)]
io_uring: save atomic dec for inline executed reqs
When a request is completed with comp_state, its completion reference
put is deferred to io_submit_flush_completions(), but the submission
is put not far from there, so do it together to save one atomic dec per
request. That targets requests that complete inline, e.g. buffered rw,
send/recv.
Proper benchmarking haven't been conducted but for nops(batch=32) it was
around 7901 vs 8117 KIOPS (~2.7%), or ~4% per perf profiling.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 19 Jan 2021 13:32:46 +0000 (13:32 +0000)]
io_uring: don't flush CQEs deep down the stack
io_submit_flush_completions() is called down the stack in the _state
version of io_req_complete(), that's ok because is only called by
io_uring opcode handler functions directly. Move it up to
__io_queue_sqe() as preparation.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 19 Jan 2021 13:32:45 +0000 (13:32 +0000)]
io_uring: help inlining of io_req_complete()
__io_req_complete() inlining is a bit weird, some compilers don't
optimise out the non-NULL branch of it even when called as
io_req_complete(). Help it a bit by extracting state and stateless
helpers out of __io_req_complete().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 19 Jan 2021 13:32:44 +0000 (13:32 +0000)]
io_uring: add a helper timeout mode calculation
Deduplicates translation of timeout flags into hrtimer_mode.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 19 Jan 2021 13:32:42 +0000 (13:32 +0000)]
io_uring: deduplicate failing task_work_add
When io_req_task_work_add() fails, the request will be cancelled by
enqueueing via task_works of io-wq. Extract a function for that.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 19 Jan 2021 13:32:41 +0000 (13:32 +0000)]
io_uring: remove __io_state_file_put
The check in io_state_file_put() is optimised pretty well when called
from __io_file_get(). Don't pollute the code with all these variants.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 19 Jan 2021 13:32:40 +0000 (13:32 +0000)]
io_uring: simplify io_alloc_req()
Get rid of a label in io_alloc_req(), it's cleaner to do return
directly.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 19 Jan 2021 13:32:39 +0000 (13:32 +0000)]
io_uring: further deduplicate #CQ events calc
Apparently, there is one more place hand coded calculation of number of
CQ events in the ring. Use __io_cqring_events() helper in
io_get_cqring() as well. Naturally, assembly stays identical.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 19 Jan 2021 13:32:38 +0000 (13:32 +0000)]
io_uring: inline __io_commit_cqring()
Inline it in its only user, that's cleaner
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 19 Jan 2021 13:32:37 +0000 (13:32 +0000)]
io_uring: inline io_async_submit()
The name is confusing and it's used only in one place.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 19 Jan 2021 13:32:36 +0000 (13:32 +0000)]
io_uring: cleanup personalities under uring_lock
personality_idr is usually synchronised by uring_lock, the exception
would be removing personalities in io_ring_ctx_wait_and_kill(), which
is legit as refs are killed by that point but still would be more
resilient to do it under the lock.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 19 Jan 2021 13:32:35 +0000 (13:32 +0000)]
io_uring: refactor io_resubmit_prep()
It's awkward to pass return a value into a function for it to return it
back. Check it at the caller site and clean up io_resubmit_prep() a bit.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 19 Jan 2021 13:32:34 +0000 (13:32 +0000)]
io_uring: optimise io_rw_reissue()
The hot path is IO completing on the first try. Reshuffle io_rw_reissue() so
it's checked first.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bijan Mottahedeh [Fri, 15 Jan 2021 17:37:52 +0000 (17:37 +0000)]
io_uring: make percpu_ref_release names consistent
Make the percpu ref release function names consistent between rsrc data
and nodes.
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bijan Mottahedeh [Fri, 15 Jan 2021 17:37:51 +0000 (17:37 +0000)]
io_uring: create common fixed_rsrc_data allocation routines
Create common alloc/free fixed_rsrc_data routines for both files and
buffers.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
[remove buffer part]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bijan Mottahedeh [Fri, 15 Jan 2021 17:37:50 +0000 (17:37 +0000)]
io_uring: create common fixed_rsrc_ref_node handling routines
Create common routines to be used for both files/buffers registration.
[remove io_sqe_rsrc_set_node substitution]
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
[merge, quiesce only for files]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 15 Jan 2021 17:37:49 +0000 (17:37 +0000)]
io_uring: split ref_node alloc and init
A simple prep patch allowing to set refnode callbacks after it was
allocated. This needed to 1) keep ourself off of hi-level functions
where it's not pretty and they are not necessary 2) amortise ref_node
allocation in the future, e.g. for updates.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bijan Mottahedeh [Fri, 15 Jan 2021 17:37:48 +0000 (17:37 +0000)]
io_uring: split alloc_fixed_file_ref_node
Split alloc_fixed_file_ref_node into resource generic/specific parts,
to be leveraged for fixed buffers.
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bijan Mottahedeh [Fri, 15 Jan 2021 17:37:47 +0000 (17:37 +0000)]
io_uring: add rsrc_ref locking routines
Encapsulate resource reference locking into separate routines.
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bijan Mottahedeh [Fri, 15 Jan 2021 17:37:46 +0000 (17:37 +0000)]
io_uring: separate ref_list from fixed_rsrc_data
Uplevel ref_list and make it common to all resources. This is to
allow one common ref_list to be used for both files, and buffers
in upcoming patches.
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bijan Mottahedeh [Fri, 15 Jan 2021 17:37:45 +0000 (17:37 +0000)]
io_uring: generalize io_queue_rsrc_removal
Generalize io_queue_rsrc_removal to handle both files and buffers.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
[remove io_mapped_ubuf from rsrc tables/etc. for now]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bijan Mottahedeh [Fri, 15 Jan 2021 17:37:44 +0000 (17:37 +0000)]
io_uring: rename file related variables to rsrc
This is a prep rename patch for subsequent patches to generalize file
registration.
[io_uring_rsrc_update:: rename fds -> data]
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
[leave io_uring_files_update as struct]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bijan Mottahedeh [Wed, 6 Jan 2021 20:39:11 +0000 (12:39 -0800)]
io_uring: modularize io_sqe_buffers_register
Move allocation of buffer management structures, and validation of
buffers into separate routines.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bijan Mottahedeh [Wed, 6 Jan 2021 20:39:10 +0000 (12:39 -0800)]
io_uring: modularize io_sqe_buffer_register
Split io_sqe_buffer_register into two routines:
- io_sqe_buffer_register() registers a single buffer
- io_sqe_buffers_register iterates over all user specified buffers
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 10 Dec 2020 19:25:36 +0000 (12:25 -0700)]
io_uring: enable LOOKUP_CACHED path resolution for filename lookups
Instead of being pessimistic and assume that path lookup will block, use
LOOKUP_CACHED to attempt just a cached lookup. This ensures that the
fast path is always done inline, and we only punt to async context if
IO is needed to satisfy the lookup.
For forced nonblock open attempts, mark the file O_NONBLOCK over the
actual ->open() call as well. We can safely clear this again before
doing fd_install(), so it'll never be user visible that we fiddled with
it.
This greatly improves the performance of file open where the dentry is
already cached:
ached 5.10-git 5.10-git+LOOKUP_CACHED Speedup
---------------------------------------------------------------
33% 1,014,975 900,474 1.1x
89% 545,466 292,937 1.9x
100% 435,636 151,475 2.9x
The more cache hot we are, the faster the inline LOOKUP_CACHED
optimization helps. This is unsurprising and expected, as a thread
offload becomes a more dominant part of the total overhead. If we look
at io_uring tracing, doing an IORING_OP_OPENAT on a file that isn't in
the dentry cache will yield:
275.550481: io_uring_create: ring
00000000ddda6278, fd 3 sq size 8, cq size 16, flags 0
275.550491: io_uring_submit_sqe: ring
00000000ddda6278, op 18, data 0x0, non block 1, sq_thread 0
275.550498: io_uring_queue_async_work: ring
00000000ddda6278, request
00000000c0267d17, flags 69760, normal queue, work
000000003d683991
275.550502: io_uring_cqring_wait: ring
00000000ddda6278, min_events 1
275.550556: io_uring_complete: ring
00000000ddda6278, user_data 0x0, result 4
which shows a failed nonblock lookup, then punt to worker, and then we
complete with fd == 4. This takes 65 usec in total. Re-running the same
test case again:
281.253956: io_uring_create: ring
0000000008207252, fd 3 sq size 8, cq size 16, flags 0
281.253967: io_uring_submit_sqe: ring
0000000008207252, op 18, data 0x0, non block 1, sq_thread 0
281.253973: io_uring_complete: ring
0000000008207252, user_data 0x0, result 4
shows the same request completing inline, also returning fd == 4. This
takes 6 usec.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Mon, 1 Feb 2021 17:02:28 +0000 (10:02 -0700)]
Merge branch 'work.namei' of git://git./linux/kernel/git/viro/vfs into for-5.12/io_uring
Merge RESOLVE_CACHED bits from Al, as the io_uring changes will build on
top of that.
* 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: expose LOOKUP_CACHED through openat2() RESOLVE_CACHED
fs: add support for LOOKUP_CACHED
saner calling conventions for unlazy_child()
fs: make unlazy_walk() error handling consistent
fs/namei.c: Remove unlikely of status being -ECHILD in lookup_fast()
do_tmpfile(): don't mess with finish_open()
Linus Torvalds [Sun, 31 Jan 2021 21:50:09 +0000 (13:50 -0800)]
Linux 5.11-rc6
Linus Torvalds [Sun, 31 Jan 2021 19:57:37 +0000 (11:57 -0800)]
Merge tag 'efi-urgent-for-v5.11' of git://git./linux/kernel/git/tip/tip
Pull EFI fix from Borislav Petkov:
"A single fix from Lukas: handle boolean device properties imported
from Apple firmware correctly"
* tag 'efi-urgent-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi/apple-properties: Reinstate support for boolean properties
Linus Torvalds [Sun, 31 Jan 2021 19:48:12 +0000 (11:48 -0800)]
Merge tag 'x86_entry_for_v5.11_rc6' of git://git./linux/kernel/git/tip/tip
Pull x86 fix from Borislav Petkov:
"A single fix for objtool to generate proper unwind info for newer
toolchains which do not generate section symbols anymore. And a
cleanup ontop.
This was originally going to go during the next merge window but
people can already trigger a build error with binutils-2.36 which
doesn't emit section symbols - something which objtool relies on - so
let's expedite it"
* tag 'x86_entry_for_v5.11_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry: Remove put_ret_addr_in_rdi THUNK macro argument
x86/entry: Emit a symbol for register restoring thunk
Linus Torvalds [Sun, 31 Jan 2021 19:40:57 +0000 (11:40 -0800)]
Merge tag 'timers-urgent-2021-01-31' of git://git./linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
"A fix for handling advertised, but non-existent 146818 RTCs correctly.
With the recent UIP handling changes the time readout of non-existent
RTCs hangs forever as the read returns always 0xFF which means the UIP
bit is set.
Sanity check the RTC before registering by checking the RTC_VALID
register for correctness"
* tag 'timers-urgent-2021-01-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rtc: mc146818: Detect and handle broken RTCs
Linus Torvalds [Sun, 31 Jan 2021 19:39:32 +0000 (11:39 -0800)]
Merge tag 'core-urgent-2021-01-31' of git://git./linux/kernel/git/tip/tip
Pull single stepping fix from Thomas Gleixner:
"A single fix for the single step reporting regression caused by
getting the condition wrong when moving SYSCALL_EMU away from TIF
flags"
[ There's apparently another problem too, fix pending ]
* tag 'core-urgent-2021-01-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
entry: Unbreak single step reporting behaviour