platform/kernel/linux-rpi.git
3 years agoio_uring: move io_init_req()'s definition
Pavel Begunkov [Thu, 18 Feb 2021 18:29:40 +0000 (18:29 +0000)]
io_uring: move io_init_req()'s definition

A preparation patch, symbol to symbol move io_init_req() +
io_check_restriction() a bit up. The submission path is pretty settled
down, so don't worry about backports and move the functions instead of
relying on forward declarations in the future.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: don't duplicate ->file check in sfr
Pavel Begunkov [Thu, 18 Feb 2021 18:29:39 +0000 (18:29 +0000)]
io_uring: don't duplicate ->file check in sfr

IORING_OP_SYNC_FILE_RANGE is marked as .needs_file, so the common path
will take care of assigning and validating req->file, no need to
duplicate it in io_sfr_prep().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: keep io_*_prep() naming consistent
Pavel Begunkov [Thu, 18 Feb 2021 18:29:38 +0000 (18:29 +0000)]
io_uring: keep io_*_prep() naming consistent

Follow io_*_prep() naming pattern, there are only fsync and sfr that
don't do that.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: kill fictitious submit iteration index
Pavel Begunkov [Thu, 18 Feb 2021 18:29:37 +0000 (18:29 +0000)]
io_uring: kill fictitious submit iteration index

@i and @submitted are very much coupled together, and there is no need
to keep them both. Remove @i, it doesn't change generated binary but
helps to keep a single source of truth.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: fix read memory leak
Pavel Begunkov [Wed, 17 Feb 2021 21:02:36 +0000 (21:02 +0000)]
io_uring: fix read memory leak

Don't forget to free iovec read inline completion and bunch of other
cases that do "goto done" before setting up an async context.

Fixes: 5ea5dd45844d ("io_uring: inline io_read()'s iovec freeing")
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: tctx->task_lock should be IRQ safe
Jens Axboe [Tue, 16 Feb 2021 17:33:53 +0000 (10:33 -0700)]
io_uring: tctx->task_lock should be IRQ safe

We add task_work from any context, hence we need to ensure that we can
tolerate it being from IRQ context as well.

Fixes: 7cbf1722d5fc ("io_uring: provide FIFO ordering for task_work")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoproc: don't allow async path resolution of /proc/thread-self components
Jens Axboe [Sun, 14 Feb 2021 20:21:43 +0000 (13:21 -0700)]
proc: don't allow async path resolution of /proc/thread-self components

If this is attempted by an io-wq kthread, then return -EOPNOTSUPP as we
don't currently support that. Once we can get task_pid_ptr() doing the
right thing, then this can go away again.

Use PF_IO_WORKER for this to speciically target the io_uring workers.
Modify the /proc/self/ check to use PF_IO_WORKER as well.

Cc: stable@vger.kernel.org
Fixes: 8d4c3e76e3be ("proc: don't allow async path resolution of /proc/self components")
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: kill cached requests from exiting task closing the ring
Jens Axboe [Sat, 13 Feb 2021 16:11:04 +0000 (09:11 -0700)]
io_uring: kill cached requests from exiting task closing the ring

Be nice and prune these upfront, in case the ring is being shared and
one of the tasks is going away. This is a bit more important now that
we account the allocations.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: add helper to free all request caches
Jens Axboe [Sat, 13 Feb 2021 16:09:44 +0000 (09:09 -0700)]
io_uring: add helper to free all request caches

We have three different ones, put it in a helper for easy calling. This
is in preparation for doing it outside of ring freeing as well. With
that in mind, also ensure that we do the proper locking for safe calling
from a context where the ring it still live.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: allow task match to be passed to io_req_cache_free()
Jens Axboe [Sat, 13 Feb 2021 16:00:02 +0000 (09:00 -0700)]
io_uring: allow task match to be passed to io_req_cache_free()

No changes in this patch, just allows a caller to pass in a targeted
task that we must match for freeing requests in the cache.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio-wq: clear out worker ->fs and ->files
Jens Axboe [Fri, 12 Feb 2021 21:02:54 +0000 (14:02 -0700)]
io-wq: clear out worker ->fs and ->files

By default, kernel threads have init_fs and init_files assigned. In the
past, this has triggered security problems, as commands that don't ask
for (and hence don't get assigned) fs/files from the originating task
can then attempt path resolution etc with access to parts of the system
they should not be able to.

Rather than add checks in the fs code for misuse, just set these to
NULL. If we do attempt to use them, then the resulting code will oops
rather than provide access to something that it should not permit.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: optimise io_init_req() flags setting
Pavel Begunkov [Fri, 12 Feb 2021 18:41:17 +0000 (18:41 +0000)]
io_uring: optimise io_init_req() flags setting

Invalid req->flags are tolerated by free/put well, avoid this dancing
needlessly presetting it to zero, and then not even resetting but
modifying it, i.e. "|=".

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: clean io_req_find_next() fast check
Pavel Begunkov [Fri, 12 Feb 2021 18:41:16 +0000 (18:41 +0000)]
io_uring: clean io_req_find_next() fast check

Indirectly io_req_find_next() is called for every request, optimise the
check by testing flags as it was long before -- __io_req_find_next()
tolerates false-positives well (i.e. link==NULL), and those should be
really rare.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: don't check PF_EXITING from syscall
Pavel Begunkov [Fri, 12 Feb 2021 18:41:15 +0000 (18:41 +0000)]
io_uring: don't check PF_EXITING from syscall

io_sq_thread_acquire_mm_files() can find a PF_EXITING task only when
it's called from task_work context. Don't check it in all other cases,
that are when we're in io_uring_enter().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: don't split out consume out of SQE get
Pavel Begunkov [Fri, 12 Feb 2021 11:55:17 +0000 (11:55 +0000)]
io_uring: don't split out consume out of SQE get

Remove io_consume_sqe() and inline it back into io_get_sqe(). It
requires req dealloc on error, but in exchange we get cleaner
io_submit_sqes() and better locality for cached_sq_head.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: save ctx put/get for task_work submit
Pavel Begunkov [Fri, 12 Feb 2021 03:23:54 +0000 (03:23 +0000)]
io_uring: save ctx put/get for task_work submit

Do a little trick in io_ring_ctx_free() briefly taking uring_lock, that
will wait for everyone currently holding it, so we can skip pinning ctx
with ctx->refs for __io_req_task_submit(), which is executed and loses
its refs/reqs while holding the lock.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: don't duplicate io_req_task_queue()
Pavel Begunkov [Fri, 12 Feb 2021 03:23:53 +0000 (03:23 +0000)]
io_uring: don't duplicate io_req_task_queue()

Don't hand code io_req_task_queue() inside of io_async_buf_func(), just
call it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: optimise SQPOLL mm/files grabbing
Pavel Begunkov [Fri, 12 Feb 2021 03:23:52 +0000 (03:23 +0000)]
io_uring: optimise SQPOLL mm/files grabbing

There are two reasons for this. First is to optimise
io_sq_thread_acquire_mm_files() for non-SQPOLL case, which currently do
too many checks and function calls in the hot path, e.g. in
io_init_req().

The second is to not grab mm/files when there are not needed. As
__io_queue_sqe() issues only one request now, we can reuse
io_sq_thread_acquire_mm_files() instead of unconditional acquire
mm/files.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: optimise out unlikely link queue
Pavel Begunkov [Fri, 12 Feb 2021 03:23:51 +0000 (03:23 +0000)]
io_uring: optimise out unlikely link queue

__io_queue_sqe() tries to issue as much requests of a link as it can,
and uses io_put_req_find_next() to extract a next one, targeting inline
completed requests. As now __io_queue_sqe() is always used together with
struct io_comp_state, it leaves next propagation only a small window and
only for async reqs, that doesn't justify its existence.

Remove it, make __io_queue_sqe() to issue only a head request. It
simplifies the code and will allow other optimisations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: take compl state from submit state
Pavel Begunkov [Fri, 12 Feb 2021 03:23:50 +0000 (03:23 +0000)]
io_uring: take compl state from submit state

Completion and submission states are now coupled together, it's weird to
get one from argument and another from ctx, do it consistently for
io_req_free_batch(). It's also faster as we already have @state cached
in registers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: inline io_complete_rw_common()
Pavel Begunkov [Thu, 11 Feb 2021 18:28:23 +0000 (18:28 +0000)]
io_uring: inline io_complete_rw_common()

__io_complete_rw() casts request to kiocb for it to be immediately
container_of()'ed by io_complete_rw_common(). And the last function's name
doesn't do a great job of illuminating its purposes, so just inline it in
its only user.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: move res check out of io_rw_reissue()
Pavel Begunkov [Thu, 11 Feb 2021 18:28:22 +0000 (18:28 +0000)]
io_uring: move res check out of io_rw_reissue()

We pass return code into io_rw_reissue() only to be able to check if it's
-EAGAIN. That's not the cleanest approach and may prevent inlining of the
non-EAGAIN fast path, so do it at call sites.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: simplify iopoll reissuing
Pavel Begunkov [Thu, 11 Feb 2021 18:28:21 +0000 (18:28 +0000)]
io_uring: simplify iopoll reissuing

Don't stash -EAGAIN'ed iopoll requests into a list to reissue it later,
do it eagerly. It removes overhead on keeping and checking that list,
and allows in case of failure for these requests to be completed through
normal iopoll completion path.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: clean up io_req_free_batch_finish()
Pavel Begunkov [Thu, 11 Feb 2021 18:28:20 +0000 (18:28 +0000)]
io_uring: clean up io_req_free_batch_finish()

io_req_free_batch_finish() is final and does not permit struct req_batch
to be reused without re-init. To be more consistent don't clear ->task
there.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: move submit side state closer in the ring
Jens Axboe [Thu, 11 Feb 2021 17:48:03 +0000 (10:48 -0700)]
io_uring: move submit side state closer in the ring

We recently added the submit side req cache, but it was placed at the
end of the struct. Move it near the other submission state for better
memory placement, and reshuffle a few other members at the same time.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: assign file_slot prior to calling io_sqe_file_register()
Jens Axboe [Thu, 11 Feb 2021 14:45:08 +0000 (07:45 -0700)]
io_uring: assign file_slot prior to calling io_sqe_file_register()

We use the assigned slot in io_sqe_file_register(), and a previous
patch moved the assignment to after we have called it. This isn't
super pretty, and will get cleaned up in the future. For now, fix
the regression by restoring the previous assignment/clear of the
file_slot.

Fixes: ea64ec02b31d ("io_uring: deduplicate file table slot calculation")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: remove redundant initialization of variable ret
Colin Ian King [Wed, 10 Feb 2021 20:00:07 +0000 (20:00 +0000)]
io_uring: remove redundant initialization of variable ret

The variable ret is being initialized with a value that is never read
and it is being updated later with a new value.  The initialization is
redundant and can be removed.

Addresses-Coverity: ("Unused value")
Fixes: b63534c41e20 ("io_uring: re-issue block requests that failed because of resources")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: unpark SQPOLL thread for cancelation
Pavel Begunkov [Wed, 10 Feb 2021 11:45:42 +0000 (11:45 +0000)]
io_uring: unpark SQPOLL thread for cancelation

We park SQPOLL task before going into io_uring_cancel_files(), so the
task won't run task_works including those that might be important for
the cancellation passes. In this case it's io_poll_remove_one(), which
frees requests via io_put_req_deferred().

Unpark it for while waiting, it's ok as we disable submissions
beforehand, so no new requests will be generated.

INFO: task syz-executor893:8493 blocked for more than 143 seconds.
Call Trace:
 context_switch kernel/sched/core.c:4327 [inline]
 __schedule+0x90c/0x21a0 kernel/sched/core.c:5078
 schedule+0xcf/0x270 kernel/sched/core.c:5157
 io_uring_cancel_files fs/io_uring.c:8912 [inline]
 io_uring_cancel_task_requests+0xe70/0x11a0 fs/io_uring.c:8979
 __io_uring_files_cancel+0x110/0x1b0 fs/io_uring.c:9067
 io_uring_files_cancel include/linux/io_uring.h:51 [inline]
 do_exit+0x2fe/0x2ae0 kernel/exit.c:780
 do_group_exit+0x125/0x310 kernel/exit.c:922
 __do_sys_exit_group kernel/exit.c:933 [inline]
 __se_sys_exit_group kernel/exit.c:931 [inline]
 __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:931
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Cc: stable@vger.kernel.org # 5.5+
Reported-by: syzbot+695b03d82fa8e4901b06@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: place ring SQ/CQ arrays under memcg memory limits
Jens Axboe [Wed, 10 Feb 2021 03:14:12 +0000 (20:14 -0700)]
io_uring: place ring SQ/CQ arrays under memcg memory limits

Instead of imposing rlimit memlock limits for the rings themselves,
ensure that we account them properly under memcg with __GFP_ACCOUNT.
We retain rlimit memlock for registered buffers, this is just for the
ring arrays themselves.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: enable kmemcg account for io_uring requests
Jens Axboe [Tue, 9 Feb 2021 20:48:50 +0000 (13:48 -0700)]
io_uring: enable kmemcg account for io_uring requests

This puts io_uring under the memory cgroups accounting and limits for
requests.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: enable req cache for IRQ driven IO
Jens Axboe [Wed, 10 Feb 2021 02:53:37 +0000 (19:53 -0700)]
io_uring: enable req cache for IRQ driven IO

This is the last class of requests that cannot utilize the req alloc
cache. Add a per-ctx req cache that is protected by the completion_lock,
and refill our submit side cache when it gets over our batch count.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: fix possible deadlock in io_uring_poll
Hao Xu [Fri, 5 Feb 2021 08:34:21 +0000 (16:34 +0800)]
io_uring: fix possible deadlock in io_uring_poll

Abaci reported follow issue:

[   30.615891] ======================================================
[   30.616648] WARNING: possible circular locking dependency detected
[   30.617423] 5.11.0-rc3-next-20210115 #1 Not tainted
[   30.618035] ------------------------------------------------------
[   30.618914] a.out/1128 is trying to acquire lock:
[   30.619520] ffff88810b063868 (&ep->mtx){+.+.}-{3:3}, at: __ep_eventpoll_poll+0x9f/0x220
[   30.620505]
[   30.620505] but task is already holding lock:
[   30.621218] ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0
[   30.622349]
[   30.622349] which lock already depends on the new lock.
[   30.622349]
[   30.623289]
[   30.623289] the existing dependency chain (in reverse order) is:
[   30.624243]
[   30.624243] -> #1 (&ctx->uring_lock){+.+.}-{3:3}:
[   30.625263]        lock_acquire+0x2c7/0x390
[   30.625868]        __mutex_lock+0xae/0x9f0
[   30.626451]        io_cqring_overflow_flush.part.95+0x6d/0x70
[   30.627278]        io_uring_poll+0xcb/0xd0
[   30.627890]        ep_item_poll.isra.14+0x4e/0x90
[   30.628531]        do_epoll_ctl+0xb7e/0x1120
[   30.629122]        __x64_sys_epoll_ctl+0x70/0xb0
[   30.629770]        do_syscall_64+0x2d/0x40
[   30.630332]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   30.631187]
[   30.631187] -> #0 (&ep->mtx){+.+.}-{3:3}:
[   30.631985]        check_prevs_add+0x226/0xb00
[   30.632584]        __lock_acquire+0x1237/0x13a0
[   30.633207]        lock_acquire+0x2c7/0x390
[   30.633740]        __mutex_lock+0xae/0x9f0
[   30.634258]        __ep_eventpoll_poll+0x9f/0x220
[   30.634879]        __io_arm_poll_handler+0xbf/0x220
[   30.635462]        io_issue_sqe+0xa6b/0x13e0
[   30.635982]        __io_queue_sqe+0x10b/0x550
[   30.636648]        io_queue_sqe+0x235/0x470
[   30.637281]        io_submit_sqes+0xcce/0xf10
[   30.637839]        __x64_sys_io_uring_enter+0x3fb/0x5b0
[   30.638465]        do_syscall_64+0x2d/0x40
[   30.638999]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   30.639643]
[   30.639643] other info that might help us debug this:
[   30.639643]
[   30.640618]  Possible unsafe locking scenario:
[   30.640618]
[   30.641402]        CPU0                    CPU1
[   30.641938]        ----                    ----
[   30.642664]   lock(&ctx->uring_lock);
[   30.643425]                                lock(&ep->mtx);
[   30.644498]                                lock(&ctx->uring_lock);
[   30.645668]   lock(&ep->mtx);
[   30.646321]
[   30.646321]  *** DEADLOCK ***
[   30.646321]
[   30.647642] 1 lock held by a.out/1128:
[   30.648424]  #0: ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0
[   30.649954]
[   30.649954] stack backtrace:
[   30.650592] CPU: 1 PID: 1128 Comm: a.out Not tainted 5.11.0-rc3-next-20210115 #1
[   30.651554] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[   30.652290] Call Trace:
[   30.652688]  dump_stack+0xac/0xe3
[   30.653164]  check_noncircular+0x11e/0x130
[   30.653747]  ? check_prevs_add+0x226/0xb00
[   30.654303]  check_prevs_add+0x226/0xb00
[   30.654845]  ? add_lock_to_list.constprop.49+0xac/0x1d0
[   30.655564]  __lock_acquire+0x1237/0x13a0
[   30.656262]  lock_acquire+0x2c7/0x390
[   30.656788]  ? __ep_eventpoll_poll+0x9f/0x220
[   30.657379]  ? __io_queue_proc.isra.88+0x180/0x180
[   30.658014]  __mutex_lock+0xae/0x9f0
[   30.658524]  ? __ep_eventpoll_poll+0x9f/0x220
[   30.659112]  ? mark_held_locks+0x5a/0x80
[   30.659648]  ? __ep_eventpoll_poll+0x9f/0x220
[   30.660229]  ? _raw_spin_unlock_irqrestore+0x2d/0x40
[   30.660885]  ? trace_hardirqs_on+0x46/0x110
[   30.661471]  ? __io_queue_proc.isra.88+0x180/0x180
[   30.662102]  ? __ep_eventpoll_poll+0x9f/0x220
[   30.662696]  __ep_eventpoll_poll+0x9f/0x220
[   30.663273]  ? __ep_eventpoll_poll+0x220/0x220
[   30.663875]  __io_arm_poll_handler+0xbf/0x220
[   30.664463]  io_issue_sqe+0xa6b/0x13e0
[   30.664984]  ? __lock_acquire+0x782/0x13a0
[   30.665544]  ? __io_queue_proc.isra.88+0x180/0x180
[   30.666170]  ? __io_queue_sqe+0x10b/0x550
[   30.666725]  __io_queue_sqe+0x10b/0x550
[   30.667252]  ? __fget_files+0x131/0x260
[   30.667791]  ? io_req_prep+0xd8/0x1090
[   30.668316]  ? io_queue_sqe+0x235/0x470
[   30.668868]  io_queue_sqe+0x235/0x470
[   30.669398]  io_submit_sqes+0xcce/0xf10
[   30.669931]  ? xa_load+0xe4/0x1c0
[   30.670425]  __x64_sys_io_uring_enter+0x3fb/0x5b0
[   30.671051]  ? lockdep_hardirqs_on_prepare+0xde/0x180
[   30.671719]  ? syscall_enter_from_user_mode+0x2b/0x80
[   30.672380]  do_syscall_64+0x2d/0x40
[   30.672901]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   30.673503] RIP: 0033:0x7fd89c813239
[   30.673962] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05  3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 ec 2c 00 f7 d8 64 89 01 48
[   30.675920] RSP: 002b:00007ffc65a7c628 EFLAGS: 00000217 ORIG_RAX: 00000000000001aa
[   30.676791] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd89c813239
[   30.677594] RDX: 0000000000000000 RSI: 0000000000000014 RDI: 0000000000000003
[   30.678678] RBP: 00007ffc65a7c720 R08: 0000000000000000 R09: 0000000003000000
[   30.679492] R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000400ff0
[   30.680282] R13: 00007ffc65a7c840 R14: 0000000000000000 R15: 0000000000000000

This might happen if we do epoll_wait on a uring fd while reading/writing
the former epoll fd in a sqe in the former uring instance.
So let's don't flush cqring overflow list, just do a simple check.

Reported-by: Abaci <abaci@linux.alibaba.com>
Fixes: 6c503150ae33 ("io_uring: patch up IOPOLL overflow_flush sync")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: defer flushing cached reqs
Pavel Begunkov [Wed, 10 Feb 2021 00:03:23 +0000 (00:03 +0000)]
io_uring: defer flushing cached reqs

Awhile there are requests in the allocation cache -- use them, only if
those ended go for the stashed memory in comp.free_list. As list
manipulation are generally heavy and are not good for caches, flush them
all or as much as can in one go.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: return success/failure from io_flush_cached_reqs()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: take comp_state from ctx
Pavel Begunkov [Wed, 10 Feb 2021 00:03:22 +0000 (00:03 +0000)]
io_uring: take comp_state from ctx

__io_queue_sqe() is always called with a non-NULL comp_state, which is
taken directly from context. Don't pass it around but infer from ctx.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: enable req cache for task_work items
Jens Axboe [Wed, 10 Feb 2021 00:03:21 +0000 (00:03 +0000)]
io_uring: enable req cache for task_work items

task_work is run without utilizing the req alloc cache, so any deferred
items don't get to take advantage of either the alloc or free side of it.
With task_work now being wrapped by io_uring, we can use the ctx
completion state to both use the req cache and the completion flush
batching.

With this, the only request type that cannot take advantage of the req
cache is IRQ driven IO for regular files / block devices. Anything else,
including IOPOLL polled IO to those same tyes, will take advantage of it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: provide FIFO ordering for task_work
Jens Axboe [Wed, 10 Feb 2021 00:03:20 +0000 (00:03 +0000)]
io_uring: provide FIFO ordering for task_work

task_work is a LIFO list, due to how it's implemented as a lockless
list. For long chains of task_work, this can be problematic as the
first entry added is the last one processed. Similarly, we'd waste
a lot of CPU cycles reversing this list.

Wrap the task_work so we have a single task_work entry per task per
ctx, and use that to run it in the right order.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: use persistent request cache
Jens Axboe [Wed, 10 Feb 2021 00:03:19 +0000 (00:03 +0000)]
io_uring: use persistent request cache

Now that we have the submit_state in the ring itself, we can have io_kiocb
allocations that are persistent across invocations. This reduces the time
spent doing slab allocations and frees.

[sil: rebased]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: feed reqs back into alloc cache
Pavel Begunkov [Wed, 10 Feb 2021 00:03:18 +0000 (00:03 +0000)]
io_uring: feed reqs back into alloc cache

Make io_req_free_batch(), which is used for inline executed requests and
IOPOLL, to return requests back into the allocation cache, so avoid
most of kmalloc()/kfree() for those cases.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: persistent req cache
Pavel Begunkov [Wed, 10 Feb 2021 00:03:17 +0000 (00:03 +0000)]
io_uring: persistent req cache

Don't free batch-allocated requests across syscalls.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: count ctx refs separately from reqs
Pavel Begunkov [Wed, 10 Feb 2021 00:03:16 +0000 (00:03 +0000)]
io_uring: count ctx refs separately from reqs

Currently batch free handles request memory freeing and ctx ref putting
together. Separate them and use different counters, that will be needed
for reusing reqs memory.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: remove fallback_req
Pavel Begunkov [Wed, 10 Feb 2021 00:03:15 +0000 (00:03 +0000)]
io_uring: remove fallback_req

Remove fallback_req for now, it gets in the way of other changes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: submit-completion free batching
Pavel Begunkov [Wed, 10 Feb 2021 00:03:14 +0000 (00:03 +0000)]
io_uring: submit-completion free batching

io_submit_flush_completions() does completion batching, but may also use
free batching as iopoll does. The main beneficiaries should be buffered
reads/writes and send/recv.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: replace list with array for compl batch
Pavel Begunkov [Wed, 10 Feb 2021 00:03:13 +0000 (00:03 +0000)]
io_uring: replace list with array for compl batch

Reincarnation of an old patch that replaces a list in struct
io_compl_batch with an array. It's needed to avoid hooking requests via
their compl.list, because it won't be always available in the future.

It's also nice to split io_submit_flush_completions() to avoid free
under locks and remove unlock/lock with a long comment describing when
it can be done.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: don't reinit submit state every time
Pavel Begunkov [Wed, 10 Feb 2021 00:03:12 +0000 (00:03 +0000)]
io_uring: don't reinit submit state every time

As now submit_state is retained across syscalls, we can save ourself
from initialising it from ground up for each io_submit_sqes(). Set some
fields during ctx allocation, and just keep them always consistent.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: remove unnecessary zeroing of ctx members]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: remove ctx from comp_state
Pavel Begunkov [Wed, 10 Feb 2021 00:03:11 +0000 (00:03 +0000)]
io_uring: remove ctx from comp_state

completion state is closely bound to ctx, we don't need to store ctx
inside as we always have it around to pass to flush.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: don't keep submit_state on stack
Pavel Begunkov [Wed, 10 Feb 2021 00:03:10 +0000 (00:03 +0000)]
io_uring: don't keep submit_state on stack

struct io_submit_state is quite big (168 bytes) and going to grow. It's
better to not keep it on stack as it is now. Move it to context, it's
always protected by uring_lock, so it's fine to have only one instance
of it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: don't propagate io_comp_state
Pavel Begunkov [Wed, 10 Feb 2021 00:03:09 +0000 (00:03 +0000)]
io_uring: don't propagate io_comp_state

There is no reason to drag io_comp_state into opcode handlers, we just
need a flag and the actual work will be done in __io_queue_sqe().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: make op handlers always take issue flags
Pavel Begunkov [Wed, 10 Feb 2021 00:03:08 +0000 (00:03 +0000)]
io_uring: make op handlers always take issue flags

Make opcode handler interfaces a bit more consistent by always passing
in issue flags. Bulky but pretty easy and mechanical change.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: replace force_nonblock with flags
Pavel Begunkov [Wed, 10 Feb 2021 00:03:07 +0000 (00:03 +0000)]
io_uring: replace force_nonblock with flags

Replace bool force_nonblock with flags. It has a long standing goal of
differentiating context from which we execute. Currently we have some
subtle places where some invariants, like holding of uring_lock, are
subtly inferred.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: cleanup up cancel SQPOLL reqs across exec
Pavel Begunkov [Sun, 7 Feb 2021 22:34:26 +0000 (22:34 +0000)]
io_uring: cleanup up cancel SQPOLL reqs across exec

For SQPOLL rings tctx_inflight() always returns zero, so it might skip
doing full cancelation. It's fine because we jam all sqpoll submissions
in any case and do go through files cancel for them, but not nice.

Do the intended full cancellation, by mimicking __io_uring_task_cancel()
waiting but impersonating SQPOLL task.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: refactor sendmsg/recvmsg iov managing
Pavel Begunkov [Fri, 5 Feb 2021 00:58:00 +0000 (00:58 +0000)]
io_uring: refactor sendmsg/recvmsg iov managing

Current iov handling with recvmsg/sendmsg may be confusing. First make a
rule for msg->iov: either it points to an allocated iov that have to be
kfree()'d later, or it's NULL and we use fast_iov. That's much better
than current 3-state (also can point to fast_iov). And rename it into
free_iov for uniformity with read/write.

Also, instead of after struct io_async_msghdr copy fixing up of
msg.msg_iter.iov has been happening in io_recvmsg()/io_sendmsg(). Move
it into io_setup_async_msg(), that's the right place.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: add comment on NULL check before kfree()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: clean iov usage for recvmsg buf select
Pavel Begunkov [Fri, 5 Feb 2021 00:57:59 +0000 (00:57 +0000)]
io_uring: clean iov usage for recvmsg buf select

Don't pretend we don't know that REQ_F_BUFFER_SELECT for recvmsg always
uses fast_iov -- clean up confusing intermixing kmsg->iov and
kmsg->fast_iov for buffer select.

Also don't init iter with garbage in __io_recvmsg_copy_hdr() only for it
to be set shortly after in io_recvmsg().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: set msg_name on msg fixup
Pavel Begunkov [Fri, 5 Feb 2021 00:57:58 +0000 (00:57 +0000)]
io_uring: set msg_name on msg fixup

io_setup_async_msg() should fully prepare io_async_msghdr, let it also
handle assigning msg_name and don't hand code it in [send,recv]msg().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring/io-wq: return 2-step work swap scheme
Pavel Begunkov [Thu, 4 Feb 2021 13:52:08 +0000 (13:52 +0000)]
io_uring/io-wq: return 2-step work swap scheme

Saving one lock/unlock for io-wq is not super important, but adds some
ugliness in the code. More important, atomic decs not turning it to zero
for some archs won't give the right ordering/barriers so the
io_steal_work() may pretty easily get subtly and completely broken.

Return back 2-step io-wq work exchange and clean it up.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: deduplicate file table slot calculation
Pavel Begunkov [Thu, 4 Feb 2021 13:52:07 +0000 (13:52 +0000)]
io_uring: deduplicate file table slot calculation

Extract a helper io_fixed_file_slot() returning a place in our fixed
files table, so we don't hand-code it three times in the code.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: io_import_iovec return type cleanup
Pavel Begunkov [Thu, 4 Feb 2021 13:52:06 +0000 (13:52 +0000)]
io_uring: io_import_iovec return type cleanup

io_import_iovec() doesn't return IO size anymore, only error code. Make
it more apparent by returning int instead of ssize and clean up
leftovers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: treat NONBLOCK and RWF_NOWAIT similarly
Pavel Begunkov [Thu, 4 Feb 2021 13:52:05 +0000 (13:52 +0000)]
io_uring: treat NONBLOCK and RWF_NOWAIT similarly

Make decision making of whether we need to retry read/write similar for
O_NONBLOCK and RWF_NOWAIT. Set REQ_F_NOWAIT when either is specified and
use it for all relevant checks. Also fix resubmitting NOWAIT requests
via io_rw_reissue().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: highlight read-retry loop
Pavel Begunkov [Thu, 4 Feb 2021 13:52:04 +0000 (13:52 +0000)]
io_uring: highlight read-retry loop

We already have implicit do-while for read-retries but with goto in the
end. Convert it to an actual do-while, it highlights it so making a
bit more understandable and is cleaner in general.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: inline io_read()'s iovec freeing
Pavel Begunkov [Thu, 4 Feb 2021 13:52:03 +0000 (13:52 +0000)]
io_uring: inline io_read()'s iovec freeing

io_read() has not the simpliest control flow with a lot of jumps and
it's hard to read. One of those is a out_free: label, which frees iovec.
However, from the middle of io_read() iovec is NULL'ed and so
kfree(iovec) is no-op, it leaves us with two place where we can inline
it and further clean up the code.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: don't forget to adjust io_size
Pavel Begunkov [Thu, 4 Feb 2021 13:52:02 +0000 (13:52 +0000)]
io_uring: don't forget to adjust io_size

We have invariant in io_read() of how much we're trying to read spilled
into an iter and io_size variable. The last one controls decision making
about whether to do read-retries. However, io_size is modified only
after the first read attempt, so if we happen to go for a third retry in
a single call to io_read(), we will get io_size greater than in the
iterator, so may lead to various side effects up to live-locking.

Modify io_size each time.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: let io_setup_async_rw take care of iovec
Pavel Begunkov [Thu, 4 Feb 2021 13:52:01 +0000 (13:52 +0000)]
io_uring: let io_setup_async_rw take care of iovec

Now we give out ownership of iovec into io_setup_async_rw(), so it
either sets request's context right or frees the iovec on error itself.
Makes our life a bit easier at call sites.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: further simplify do_read error parsing
Pavel Begunkov [Thu, 4 Feb 2021 13:52:00 +0000 (13:52 +0000)]
io_uring: further simplify do_read error parsing

First, instead of checking iov_iter_count(iter) for 0 to find out that
all needed bytes were read, just compare returned code against io_size.
It's more reliable and arguably cleaner.

Also, place the half-read case into an else branch and delete an extra
label.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: refactor io_read for unsupported nowait
Pavel Begunkov [Thu, 4 Feb 2021 13:51:59 +0000 (13:51 +0000)]
io_uring: refactor io_read for unsupported nowait

!io_file_supports_async() case of io_read() is hard to read, it jumps
somewhere in the middle of the function just to do async setup and fail
on a similar check. Call io_setup_async_rw() directly for this case,
it's much easier to follow.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: refactor io_cqring_wait
Pavel Begunkov [Thu, 4 Feb 2021 13:51:58 +0000 (13:51 +0000)]
io_uring: refactor io_cqring_wait

It's easy to make a mistake in io_cqring_wait() because for all
break/continue clauses we need to watch for prepare/finish_wait to be
used correctly. Extract all those into a new helper
io_cqring_wait_schedule(), and transforming the loop into simple series
of func calls: prepare(); check_and_schedule(); finish();

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: refactor scheduling in io_cqring_wait
Pavel Begunkov [Thu, 4 Feb 2021 13:51:57 +0000 (13:51 +0000)]
io_uring: refactor scheduling in io_cqring_wait

schedule_timeout() with timeout=MAX_SCHEDULE_TIMEOUT is guaranteed to
work just as schedule(), so instead of hand-coding it based on arguments
always use the timeout version and simplify code.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: deduplicate core cancellations sequence
Pavel Begunkov [Thu, 4 Feb 2021 13:51:56 +0000 (13:51 +0000)]
io_uring: deduplicate core cancellations sequence

Files and task cancellations go over same steps trying to cancel
requests in io-wq, poll, etc. Deduplicate it with a helper.

note: new io_uring_try_cancel_requests() is former
__io_uring_cancel_task_requests() with files passed as an agrument and
flushing overflowed requests.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: simplify do_read return parsing
Pavel Begunkov [Mon, 1 Feb 2021 18:59:56 +0000 (18:59 +0000)]
io_uring: simplify do_read return parsing

do_read() returning 0 bytes read (not -EAGAIN/etc.) is not an important
enough of a case to prioritise it. Fold it into ret < 0 check, so we get
rid of an extra if and make it a bit more readable.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: deduplicate adding to REQ_F_INFLIGHT
Pavel Begunkov [Mon, 1 Feb 2021 18:59:55 +0000 (18:59 +0000)]
io_uring: deduplicate adding to REQ_F_INFLIGHT

We don't know for how long REQ_F_INFLIGHT is going to stay, cleaner to
extract a helper for marking requests as so.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: remove work flags after cleanup
Pavel Begunkov [Mon, 1 Feb 2021 18:59:54 +0000 (18:59 +0000)]
io_uring: remove work flags after cleanup

Shouldn't be a problem now, but it's better to clean
REQ_F_WORK_INITIALIZED and work->flags only after relevant resources are
killed, so cancellation see them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: inline io_req_drop_files()
Pavel Begunkov [Mon, 1 Feb 2021 18:59:53 +0000 (18:59 +0000)]
io_uring: inline io_req_drop_files()

req->files now have same lifetime as all other iowq-work resources,
inline io_req_drop_files() for consistency. Moreover, since
REQ_F_INFLIGHT is no more files specific, the function name became
very confusing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: kill not used needs_file_no_error
Pavel Begunkov [Mon, 1 Feb 2021 18:59:52 +0000 (18:59 +0000)]
io_uring: kill not used needs_file_no_error

We have no request types left using needs_file_no_error, remove it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: fix inconsistent lock state
Pavel Begunkov [Mon, 1 Feb 2021 18:59:51 +0000 (18:59 +0000)]
io_uring: fix inconsistent lock state

WARNING: inconsistent lock state

inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
syz-executor217/8450 [HC1[1]:SC0[0]:HE0:SE1] takes:
ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: spin_lock include/linux/spinlock.h:354 [inline]
ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: io_req_clean_work fs/io_uring.c:1398 [inline]
ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: io_dismantle_req+0x66f/0xf60 fs/io_uring.c:2029

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&fs->lock);
  <Interrupt>
    lock(&fs->lock);

 *** DEADLOCK ***

1 lock held by syz-executor217/8450:
 #0: ffff88802417c3e8 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0x1071/0x1f30 fs/io_uring.c:9442

stack backtrace:
CPU: 1 PID: 8450 Comm: syz-executor217 Not tainted 5.11.0-rc5-next-20210129-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
[...]
 _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
 spin_lock include/linux/spinlock.h:354 [inline]
 io_req_clean_work fs/io_uring.c:1398 [inline]
 io_dismantle_req+0x66f/0xf60 fs/io_uring.c:2029
 __io_free_req+0x3d/0x2e0 fs/io_uring.c:2046
 io_free_req fs/io_uring.c:2269 [inline]
 io_double_put_req fs/io_uring.c:2392 [inline]
 io_put_req+0xf9/0x570 fs/io_uring.c:2388
 io_link_timeout_fn+0x30c/0x480 fs/io_uring.c:6497
 __run_hrtimer kernel/time/hrtimer.c:1519 [inline]
 __hrtimer_run_queues+0x609/0xe40 kernel/time/hrtimer.c:1583
 hrtimer_interrupt+0x334/0x940 kernel/time/hrtimer.c:1645
 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1085 [inline]
 __sysvec_apic_timer_interrupt+0x146/0x540 arch/x86/kernel/apic/apic.c:1102
 asm_call_irq_on_stack+0xf/0x20
 </IRQ>
 __run_sysvec_on_irqstack arch/x86/include/asm/irq_stack.h:37 [inline]
 run_sysvec_on_irqstack_cond arch/x86/include/asm/irq_stack.h:89 [inline]
 sysvec_apic_timer_interrupt+0xbd/0x100 arch/x86/kernel/apic/apic.c:1096
 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:629
RIP: 0010:__raw_spin_unlock_irq include/linux/spinlock_api_smp.h:169 [inline]
RIP: 0010:_raw_spin_unlock_irq+0x25/0x40 kernel/locking/spinlock.c:199
 spin_unlock_irq include/linux/spinlock.h:404 [inline]
 io_queue_linked_timeout+0x194/0x1f0 fs/io_uring.c:6525
 __io_queue_sqe+0x328/0x1290 fs/io_uring.c:6594
 io_queue_sqe+0x631/0x10d0 fs/io_uring.c:6639
 io_queue_link_head fs/io_uring.c:6650 [inline]
 io_submit_sqe fs/io_uring.c:6697 [inline]
 io_submit_sqes+0x19b5/0x2720 fs/io_uring.c:6960
 __do_sys_io_uring_enter+0x107d/0x1f30 fs/io_uring.c:9443
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Don't free requests from under hrtimer context (softirq) as it may sleep
or take spinlocks improperly (e.g. non-irq versions).

Cc: stable@vger.kernel.org # 5.6+
Reported-by: syzbot+81d17233a2b02eafba33@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: Fix NULL dereference in error in io_sqe_files_register()
Dan Carpenter [Mon, 1 Feb 2021 12:23:42 +0000 (15:23 +0300)]
io_uring: Fix NULL dereference in error in io_sqe_files_register()

If we hit a "goto out_free;" before the "ctx->file_data" pointer has
been assigned then it leads to a NULL derefence when we call:

free_fixed_rsrc_data(ctx->file_data);

We can fix this by moving the assignment earlier.

Fixes: 1ad555c6ae6e ("io_uring: create common fixed_rsrc_data allocation routines")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: check kthread parked flag before sqthread goes to sleep
Hao Xu [Sun, 31 Jan 2021 14:39:04 +0000 (22:39 +0800)]
io_uring: check kthread parked flag before sqthread goes to sleep

Abaci reported this issue:

#[  605.170872] INFO: task kworker/u4:1:53 blocked for more than 143 seconds.
[  605.172123]       Not tainted 5.10.0+ #1
[  605.172811] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  605.173915] task:kworker/u4:1    state:D stack:    0 pid:   53 ppid:     2 flags:0x00004000
[  605.175130] Workqueue: events_unbound io_ring_exit_work
[  605.175931] Call Trace:
[  605.176334]  __schedule+0xe0e/0x25a0
[  605.176971]  ? firmware_map_remove+0x1a1/0x1a1
[  605.177631]  ? write_comp_data+0x2a/0x80
[  605.178272]  schedule+0xd0/0x270
[  605.178811]  schedule_timeout+0x6b6/0x940
[  605.179415]  ? mark_lock.part.0+0xca/0x1420
[  605.180062]  ? usleep_range+0x170/0x170
[  605.180684]  ? wait_for_completion+0x16d/0x280
[  605.181392]  ? mark_held_locks+0x9e/0xe0
[  605.182079]  ? rwlock_bug.part.0+0x90/0x90
[  605.182853]  ? lockdep_hardirqs_on_prepare+0x286/0x400
[  605.183817]  wait_for_completion+0x175/0x280
[  605.184713]  ? wait_for_completion_interruptible+0x340/0x340
[  605.185611]  ? _raw_spin_unlock_irq+0x24/0x30
[  605.186307]  ? migrate_swap_stop+0x9c0/0x9c0
[  605.187046]  kthread_park+0x127/0x1c0
[  605.187738]  io_sq_thread_stop+0xd5/0x530
[  605.188459]  io_ring_exit_work+0xb1/0x970
[  605.189207]  process_one_work+0x92c/0x1510
[  605.189947]  ? pwq_dec_nr_in_flight+0x360/0x360
[  605.190682]  ? rwlock_bug.part.0+0x90/0x90
[  605.191430]  ? write_comp_data+0x2a/0x80
[  605.192207]  worker_thread+0x9b/0xe20
[  605.192900]  ? process_one_work+0x1510/0x1510
[  605.193599]  kthread+0x353/0x460
[  605.194154]  ? _raw_spin_unlock_irq+0x24/0x30
[  605.194910]  ? kthread_create_on_node+0x100/0x100
[  605.195821]  ret_from_fork+0x1f/0x30
[  605.196605]
[  605.196605] Showing all locks held in the system:
[  605.197598] 1 lock held by khungtaskd/25:
[  605.198301]  #0: ffffffff8b5f76a0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.0+0x0/0x30
[  605.199914] 3 locks held by kworker/u4:1/53:
[  605.200609]  #0: ffff888100109938 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x82a/0x1510
[  605.202108]  #1: ffff888100e47dc0 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work+0x85e/0x1510
[  605.203681]  #2: ffff888116931870 (&sqd->lock){+.+.}-{3:3}, at: io_sq_thread_park.part.0+0x19/0x50
[  605.205183] 3 locks held by systemd-journal/161:
[  605.206037] 1 lock held by syslog-ng/254:
[  605.206674] 2 locks held by agetty/311:
[  605.207292]  #0: ffff888101097098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x27/0x80
[  605.208715]  #1: ffffc900000332e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x222/0x1bb0
[  605.210131] 2 locks held by bash/677:
[  605.210723]  #0: ffff88810419a098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x27/0x80
[  605.212105]  #1: ffffc900000512e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x222/0x1bb0
[  605.213777]
[  605.214151] =============================================

I believe this is caused by the follow race:

(ctx_list is empty now)
=> io_put_sq_data               |
==> kthread_park(sqd->thread);  |
====> set KTHREAD_SHOULD_PARK |
====> wake_up_process(k)        | sq thread is running
|
|
| needs_sched is true since no ctx,
| so TASK_INTERRUPTIBLE set and schedule
| out then never wake up again
|
====> wait_for_completion |
(stuck here)

So check if sqthread gets park flag right before schedule().
since ctx_list is always empty when this problem happens, here I put
kthread_should_park() before setting the wakeup flag(ctx_list is empty
so this for loop is fast), where is close enough to schedule(). The
problem doesn't show again in my repro testing after this fix.

Reported-by: Abaci <abaci@linux.alibaba.com>
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoMAINTAINERS: update io_uring section
Pavel Begunkov [Wed, 27 Jan 2021 01:25:04 +0000 (01:25 +0000)]
MAINTAINERS: update io_uring section

Add the missing kernel io_uring header, add Pavel as a reviewer, and
exclude io_uring from the FILESYSTEMS section to avoid keep spamming Al
(mainly) with bug reports, patches, etc.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: Add skip option for __io_sqe_files_update
noah [Tue, 26 Jan 2021 20:23:28 +0000 (15:23 -0500)]
io_uring: Add skip option for __io_sqe_files_update

This patch adds support for skipping a file descriptor when using
IORING_REGISTER_FILES_UPDATE.  __io_sqe_files_update will skip fds set
to IORING_REGISTER_FILES_SKIP. IORING_REGISTER_FILES_SKIP is inturn
added as a #define in io_uring.h

Signed-off-by: noah <goldstein.w.n@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: cleanup files_update looping
Pavel Begunkov [Tue, 26 Jan 2021 13:51:09 +0000 (13:51 +0000)]
io_uring: cleanup files_update looping

Replace a while with a simple for loop, that looks way more natural, and
enables us to use "continue" as indexes are no more updated by hand in
the end of the loop.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: consolidate putting reqs task
Pavel Begunkov [Mon, 25 Jan 2021 11:42:21 +0000 (11:42 +0000)]
io_uring: consolidate putting reqs task

We grab a task for each request and while putting it it also have to do
extra work like inflight accounting and waking up that task. This
sequence is duplicated several time, it's good time to add a helper.
More to that, the helper generates better code due to better locality
and so not failing alias analysis.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: ensure only sqo_task has file notes
Pavel Begunkov [Mon, 25 Jan 2021 11:42:20 +0000 (11:42 +0000)]
io_uring: ensure only sqo_task has file notes

For SQPOLL io_uring we want to have only one file note held by
sqo_task. Add a warning to make sure it holds. It's deep in
io_uring_add_task_file() out of hot path, so shouldn't hurt.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: simplify io_remove_personalities()
Yejune Deng [Thu, 24 Dec 2020 03:02:20 +0000 (11:02 +0800)]
io_uring: simplify io_remove_personalities()

The function io_remove_personalities() is very similar to
io_unregister_personality(),so implement io_remove_personalities()
calling io_unregister_personality().

Signed-off-by: Yejune Deng <yejune.deng@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring/io-wq: kill off now unused IO_WQ_WORK_NO_CANCEL
Jens Axboe [Tue, 19 Jan 2021 22:53:54 +0000 (15:53 -0700)]
io_uring/io-wq: kill off now unused IO_WQ_WORK_NO_CANCEL

It's no longer used as IORING_OP_CLOSE got rid for the need of flagging
it as uncancelable, kill it of.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: get rid of intermediate IORING_OP_CLOSE stage
Jens Axboe [Tue, 19 Jan 2021 22:50:37 +0000 (15:50 -0700)]
io_uring: get rid of intermediate IORING_OP_CLOSE stage

We currently split the close into two, in case we have a ->flush op
that we can't safely handle from non-blocking context. This requires
us to flag the op as uncancelable if we do need to punt it async, and
that means special handling for just this op type.

Use __close_fd_get_file() and grab the files lock so we can get the file
and check if we need to go async in one atomic operation. That gets rid
of the need for splitting this into two steps, and hence the need for
IO_WQ_WORK_NO_CANCEL.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agofs: provide locked helper variant of close_fd_get_file()
Jens Axboe [Tue, 19 Jan 2021 22:41:52 +0000 (15:41 -0700)]
fs: provide locked helper variant of close_fd_get_file()

Assumes current->files->file_lock is already held on invocation. Helps
the caller check the file before removing the fd, if it needs to.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: save atomic dec for inline executed reqs
Pavel Begunkov [Tue, 19 Jan 2021 13:32:47 +0000 (13:32 +0000)]
io_uring: save atomic dec for inline executed reqs

When a request is completed with comp_state, its completion reference
put is deferred to io_submit_flush_completions(), but the submission
is put not far from there, so do it together to save one atomic dec per
request. That targets requests that complete inline, e.g. buffered rw,
send/recv.

Proper benchmarking haven't been conducted but for nops(batch=32) it was
around 7901 vs 8117 KIOPS (~2.7%), or ~4% per perf profiling.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: don't flush CQEs deep down the stack
Pavel Begunkov [Tue, 19 Jan 2021 13:32:46 +0000 (13:32 +0000)]
io_uring: don't flush CQEs deep down the stack

io_submit_flush_completions() is called down the stack in the _state
version of io_req_complete(), that's ok because is only called by
io_uring opcode handler functions directly. Move it up to
__io_queue_sqe() as preparation.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: help inlining of io_req_complete()
Pavel Begunkov [Tue, 19 Jan 2021 13:32:45 +0000 (13:32 +0000)]
io_uring: help inlining of io_req_complete()

__io_req_complete() inlining is a bit weird, some compilers don't
optimise out the non-NULL branch of it even when called as
io_req_complete(). Help it a bit by extracting state and stateless
helpers out of __io_req_complete().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: add a helper timeout mode calculation
Pavel Begunkov [Tue, 19 Jan 2021 13:32:44 +0000 (13:32 +0000)]
io_uring: add a helper timeout mode calculation

Deduplicates translation of timeout flags into hrtimer_mode.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: deduplicate failing task_work_add
Pavel Begunkov [Tue, 19 Jan 2021 13:32:42 +0000 (13:32 +0000)]
io_uring: deduplicate failing task_work_add

When io_req_task_work_add() fails, the request will be cancelled by
enqueueing via task_works of io-wq. Extract a function for that.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: remove __io_state_file_put
Pavel Begunkov [Tue, 19 Jan 2021 13:32:41 +0000 (13:32 +0000)]
io_uring: remove __io_state_file_put

The check in io_state_file_put() is optimised pretty well when called
from __io_file_get(). Don't pollute the code with all these variants.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: simplify io_alloc_req()
Pavel Begunkov [Tue, 19 Jan 2021 13:32:40 +0000 (13:32 +0000)]
io_uring: simplify io_alloc_req()

Get rid of a label in io_alloc_req(), it's cleaner to do return
directly.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: further deduplicate #CQ events calc
Pavel Begunkov [Tue, 19 Jan 2021 13:32:39 +0000 (13:32 +0000)]
io_uring: further deduplicate #CQ events calc

Apparently, there is one more place hand coded calculation of number of
CQ events in the ring. Use __io_cqring_events() helper in
io_get_cqring() as well. Naturally, assembly stays identical.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: inline __io_commit_cqring()
Pavel Begunkov [Tue, 19 Jan 2021 13:32:38 +0000 (13:32 +0000)]
io_uring: inline __io_commit_cqring()

Inline it in its only user, that's cleaner

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: inline io_async_submit()
Pavel Begunkov [Tue, 19 Jan 2021 13:32:37 +0000 (13:32 +0000)]
io_uring: inline io_async_submit()

The name is confusing and it's used only in one place.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: cleanup personalities under uring_lock
Pavel Begunkov [Tue, 19 Jan 2021 13:32:36 +0000 (13:32 +0000)]
io_uring: cleanup personalities under uring_lock

personality_idr is usually synchronised by uring_lock, the exception
would be removing personalities in io_ring_ctx_wait_and_kill(), which
is legit as refs are killed by that point but still would be more
resilient to do it under the lock.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: refactor io_resubmit_prep()
Pavel Begunkov [Tue, 19 Jan 2021 13:32:35 +0000 (13:32 +0000)]
io_uring: refactor io_resubmit_prep()

It's awkward to pass return a value into a function for it to return it
back. Check it at the caller site and clean up io_resubmit_prep() a bit.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: optimise io_rw_reissue()
Pavel Begunkov [Tue, 19 Jan 2021 13:32:34 +0000 (13:32 +0000)]
io_uring: optimise io_rw_reissue()

The hot path is IO completing on the first try. Reshuffle io_rw_reissue() so
it's checked first.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: make percpu_ref_release names consistent
Bijan Mottahedeh [Fri, 15 Jan 2021 17:37:52 +0000 (17:37 +0000)]
io_uring: make percpu_ref_release names consistent

Make the percpu ref release function names consistent between rsrc data
and nodes.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: create common fixed_rsrc_data allocation routines
Bijan Mottahedeh [Fri, 15 Jan 2021 17:37:51 +0000 (17:37 +0000)]
io_uring: create common fixed_rsrc_data allocation routines

Create common alloc/free fixed_rsrc_data routines for both files and
buffers.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
[remove buffer part]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: create common fixed_rsrc_ref_node handling routines
Bijan Mottahedeh [Fri, 15 Jan 2021 17:37:50 +0000 (17:37 +0000)]
io_uring: create common fixed_rsrc_ref_node handling routines

Create common routines to be used for both files/buffers registration.

[remove io_sqe_rsrc_set_node substitution]

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
[merge, quiesce only for files]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoio_uring: split ref_node alloc and init
Pavel Begunkov [Fri, 15 Jan 2021 17:37:49 +0000 (17:37 +0000)]
io_uring: split ref_node alloc and init

A simple prep patch allowing to set refnode callbacks after it was
allocated. This needed to 1) keep ourself off of hi-level functions
where it's not pretty and they are not necessary 2) amortise ref_node
allocation in the future, e.g. for updates.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>