Mike Snitzer [Tue, 14 Feb 2023 18:06:05 +0000 (13:06 -0500)]
dm: remove flush_scheduled_work() during local_exit()
Commit
acfe0ad74d2e1 ("dm: allocate a special workqueue for deferred
device removal") switched from using system workqueue to a single
workqueue local to DM. But it didn't eliminate the call to
flush_scheduled_work() that was introduced purely for the benefit of
deferred device removal with commit
2c140a246dc ("dm: allow remove to
be deferred").
Since DM core uses its own workqueue (and queue_work) there is no need
to call flush_scheduled_work() from local_exit(). local_exit()'s
destroy_workqueue(deferred_remove_workqueue) handles flushing work
started with queue_work().
Fixes:
acfe0ad74d2e1 ("dm: allocate a special workqueue for deferred device removal")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Tue, 7 Feb 2023 22:29:02 +0000 (23:29 +0100)]
dm clone: prefer kvmalloc_array()
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Tue, 7 Feb 2023 22:15:36 +0000 (23:15 +0100)]
dm: declare variables static when sensible
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Tue, 7 Feb 2023 22:06:47 +0000 (23:06 +0100)]
dm: fix suspect indent whitespace
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Tue, 7 Feb 2023 22:04:01 +0000 (23:04 +0100)]
dm ioctl: prefer strscpy() instead of strlcpy()
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Tue, 7 Feb 2023 21:27:11 +0000 (22:27 +0100)]
dm: avoid void function return statements
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Tue, 7 Feb 2023 21:22:08 +0000 (22:22 +0100)]
dm integrity: change macros min/max() -> min_t/max_t where appropriate
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Tue, 7 Feb 2023 21:16:53 +0000 (22:16 +0100)]
dm: fix use of sizeof() macro
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Tue, 7 Feb 2023 21:07:22 +0000 (22:07 +0100)]
dm: avoid 'do {} while(0)' loop in single statement macros
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Tue, 7 Feb 2023 21:02:17 +0000 (22:02 +0100)]
dm log: avoid multiple line dereference
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Tue, 7 Feb 2023 21:00:44 +0000 (22:00 +0100)]
dm log: avoid trailing semicolon in macro
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Tue, 7 Feb 2023 20:47:45 +0000 (21:47 +0100)]
dm ioctl: have constant on the right side of the test
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Tue, 7 Feb 2023 20:02:51 +0000 (21:02 +0100)]
dm: don't indent labels
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Tue, 7 Feb 2023 19:56:57 +0000 (20:56 +0100)]
dm: avoid inline filenames
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Tue, 7 Feb 2023 19:48:51 +0000 (20:48 +0100)]
dm: add missing blank line after declarations/fix those
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Tue, 7 Feb 2023 19:35:25 +0000 (20:35 +0100)]
dm: avoid useless 'else' after 'break' or return'
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Tue, 7 Feb 2023 17:38:28 +0000 (18:38 +0100)]
dm: favour __packed versus "__attribute__ ((packed))"
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Tue, 7 Feb 2023 17:35:43 +0000 (18:35 +0100)]
dm: favour __aligned(N) versus "__attribute__ (aligned(N))"
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Mon, 6 Feb 2023 22:58:05 +0000 (23:58 +0100)]
dm: avoid using symbolic permissions
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Mon, 6 Feb 2023 22:42:32 +0000 (23:42 +0100)]
dm: prefer '"%s...", __func__'
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Fri, 3 Feb 2023 21:17:21 +0000 (22:17 +0100)]
dm: adjust EXPORT_SYMBOL() to follow functions immediately
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Fri, 3 Feb 2023 17:55:47 +0000 (18:55 +0100)]
dm: avoid split of quoted strings where possible
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Thu, 2 Feb 2023 16:10:52 +0000 (17:10 +0100)]
dm: remove unnecessary braces from single statement blocks
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Wed, 1 Feb 2023 22:42:29 +0000 (23:42 +0100)]
dm: add missing empty lines
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Wed, 1 Feb 2023 21:36:19 +0000 (22:36 +0100)]
dm: add argument identifier names
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Wed, 1 Feb 2023 21:31:43 +0000 (22:31 +0100)]
dm: avoid spaces before function arguments or in favour of tabs
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Wed, 1 Feb 2023 20:57:45 +0000 (21:57 +0100)]
dm block-manager: avoid not required parentheses
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Tue, 31 Jan 2023 18:47:36 +0000 (19:47 +0100)]
dm crypt: correct 'foo*' to 'foo *'
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Mon, 30 Jan 2023 21:13:54 +0000 (22:13 +0100)]
dm: fix trailing statements
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Mon, 30 Jan 2023 20:43:57 +0000 (21:43 +0100)]
dm: fix undue/missing spaces
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Thu, 26 Jan 2023 14:48:30 +0000 (15:48 +0100)]
dm: correct block comments format.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Wed, 25 Jan 2023 22:31:55 +0000 (23:31 +0100)]
dm: address indent/space issues
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Wed, 25 Jan 2023 21:57:42 +0000 (22:57 +0100)]
dm: address space issues relative to switch/while/for/...
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Mon, 30 Jan 2023 20:28:24 +0000 (21:28 +0100)]
dm: avoid initializing static variables
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Wed, 1 Feb 2023 20:51:04 +0000 (21:51 +0100)]
dm: enclose complex macros into parentheses where possible
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Wed, 1 Feb 2023 20:17:44 +0000 (21:17 +0100)]
dm: avoid assignment in if conditions
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Wed, 25 Jan 2023 20:14:58 +0000 (21:14 +0100)]
dm: change "unsigned" to "unsigned int"
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Wed, 25 Jan 2023 21:44:39 +0000 (22:44 +0100)]
dm: use fsleep() instead of msleep() for deterministic sleep duration
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Tue, 7 Feb 2023 19:22:58 +0000 (20:22 +0100)]
dm: prefer kmap_local_page() instead of deprecated kmap_atomic()
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Heinz Mauelshagen [Wed, 25 Jan 2023 20:00:44 +0000 (21:00 +0100)]
dm: add missing SPDX-License-Indentifiers
'GPL-2.0-only' is used instead of 'GPL-2.0' because SPDX has
deprecated its use.
Suggested-by: John Wiele <jwiele@redhat.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Mikulas Patocka [Tue, 7 Feb 2023 13:33:06 +0000 (08:33 -0500)]
dm: send just one event on resize, not two
Device mapper sends an uevent when the device is suspended, using the
function set_capacity_and_notify. However, this causes a race condition
with udev.
Udev skips scanning dm devices that are suspended. If we send an uevent
while we are suspended, udev will be racing with device mapper resume
code. If the device mapper resume code wins the race, udev will process
the uevent after the device is resumed and it will properly scan the
device.
However, if udev wins the race, it will receive the uevent, find out that
the dm device is suspended and skip scanning the device. This causes bugs
such as systemd unmounting the device - see
https://bugzilla.redhat.com/show_bug.cgi?id=2158628
This commit fixes this race.
We replace the function set_capacity_and_notify with set_capacity, so that
the uevent is not sent at this point. In do_resume, we detect if the
capacity has changed and we pass a boolean variable need_resize_uevent to
dm_kobject_uevent. dm_kobject_uevent adds "RESIZE=1" to the uevent if
need_resize_uevent is set.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Peter Rajnoha <prajnoha@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Benjamin Marzinski [Tue, 31 Jan 2023 21:22:57 +0000 (15:22 -0600)]
dm table: check that a dm device doesn't reference itself
If a DM device's table references itself, it will crash the kernel with an
infinite recursion. Check for a self-reference in dm_get_device(). This
is a quick check, but it won't catch more complicated circular references.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Yu Zhe [Mon, 6 Feb 2023 03:27:39 +0000 (11:27 +0800)]
dm raid: fix some spelling mistakes in comments
Signed-off-by: Yu Zhe <yuzhe@nfschina.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Nathan Huckleberry [Thu, 2 Feb 2023 01:23:48 +0000 (17:23 -0800)]
dm verity: stop using WQ_UNBOUND for verify_wq
Setting WQ_UNBOUND increases scheduler latency on ARM64. This is
likely due to the asymmetric architecture of ARM64 processors.
I've been unable to reproduce the results that claim WQ_UNBOUND gives
a performance boost on x86-64.
This flag is causing performance issues for multiple subsystems within
Android. Notably, the same slowdown exists for decompression with
EROFS.
| open-prebuilt-camera | WQ_UNBOUND | ~WQ_UNBOUND |
|-----------------------|------------|---------------|
| verity wait time (us) | 11746 | 119 (-98%) |
| erofs wait time (us) | 357805 | 174205 (-51%) |
| sha256 ramdisk random read | WQ_UNBOUND | ~WQ_UNBOUND |
|----------------------------|-----------=---|-------------|
| arm64 (accelerated) | bw=42.4MiB/s | bw=212MiB/s |
| arm64 (generic) | bw=16.5MiB/s | bw=48MiB/s |
| x86_64 (generic) | bw=233MiB/s | bw=230MiB/s |
Using a alloc_workqueue() @max_active arg of num_online_cpus() only
made sense with WQ_UNBOUND. Switch the @max_active arg to 0 (aka
default, which is 256 per-cpu).
Also, eliminate 'wq_flags' since it really doesn't serve a purpose.
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Nathan Huckleberry <nhuck@google.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Jiapeng Chong [Tue, 31 Jan 2023 06:09:41 +0000 (14:09 +0800)]
dm integrity: Remove bi_sector that's only used by commented debug code
drivers/md/dm-integrity.c:1738:13: warning: variable 'bi_sector' set but not used.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3895
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Christophe JAILLET [Sat, 28 Jan 2023 15:43:00 +0000 (16:43 +0100)]
dm crypt: Slightly simplify crypt_set_keyring_key()
Use strchr() instead of strpbrk() when there is only 1 element in the set
of characters to look for.
This potentially saves a few cycles, but gcc does already account for
optimizing this pattern thanks to it's fold_builtin_strpbrk().
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Sergey Shtylyov [Tue, 17 Jan 2023 20:59:55 +0000 (23:59 +0300)]
dm ioctl: drop always-false condition
The expression 'indata[3] > ULONG_MAX' always evaluates to false since
indata[] is declared as an array of *unsigned long* elements and #define
ULONG_MAX represents the max value of that exact type...
Note that gcc seems to be able to detect the dead code here and eliminate
this check anyway...
Found by Linux Verification Center (linuxtesting.org) with the SVACE static
analysis tool.
Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Mikulas Patocka [Sun, 22 Jan 2023 19:03:56 +0000 (14:03 -0500)]
dm flakey: fix logic when corrupting a bio
If "corrupt_bio_byte" is set to corrupt reads and corrupt_bio_flags is
used, dm-flakey would erroneously return all writes as errors. Likewise,
if "corrupt_bio_byte" is set to corrupt writes, dm-flakey would return
errors for all reads.
Fix the logic so that if fc->corrupt_bio_byte is non-zero, dm-flakey
will not abort reads on writes with an error.
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Mikulas Patocka [Sun, 22 Jan 2023 19:03:31 +0000 (14:03 -0500)]
dm flakey: fix a bug with 32-bit highmem systems
The function page_address does not work with 32-bit systems with high
memory. Use bvec_kmap_local/kunmap_local instead.
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Mikulas Patocka [Sun, 22 Jan 2023 19:02:57 +0000 (14:02 -0500)]
dm flakey: don't corrupt the zero page
When we need to zero some range on a block device, the function
__blkdev_issue_zero_pages submits a write bio with the bio vector pointing
to the zero page. If we use dm-flakey with corrupt bio writes option, it
will corrupt the content of the zero page which results in crashes of
various userspace programs. Glibc assumes that memory returned by mmap is
zeroed and it uses it for calloc implementation; if the newly mapped
memory is not zeroed, calloc will return non-zeroed memory.
Fix this bug by testing if the page is equal to ZERO_PAGE(0) and
avoiding the corruption in this case.
Cc: stable@vger.kernel.org
Fixes:
a00f5276e266 ("dm flakey: Properly corrupt multi-page bios.")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Joe Thornber [Thu, 26 Jan 2023 10:14:26 +0000 (10:14 +0000)]
dm cache: Add some documentation to dm-cache-background-tracker.h
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Joe Thornber [Thu, 26 Jan 2023 09:59:10 +0000 (09:59 +0000)]
dm cache: free background tracker's queued work in btracker_destroy
Otherwise the kernel can BUG with:
[ 2245.426978] =============================================================================
[ 2245.435155] BUG bt_work (Tainted: G B W ): Objects remaining in bt_work on __kmem_cache_shutdown()
[ 2245.445233] -----------------------------------------------------------------------------
[ 2245.445233]
[ 2245.454879] Slab 0x00000000b0ce2b30 objects=64 used=2 fp=0x000000000a3c6a4e flags=0x17ffffc0000200(slab|node=0|zone=2|lastcpupid=0x1fffff)
[ 2245.467300] CPU: 7 PID: 10805 Comm: lvm Kdump: loaded Tainted: G B W 6.0.0-rc2 #19
[ 2245.476078] Hardware name: Dell Inc. PowerEdge R7525/0590KW, BIOS 2.5.6 10/06/2021
[ 2245.483646] Call Trace:
[ 2245.486100] <TASK>
[ 2245.488206] dump_stack_lvl+0x34/0x48
[ 2245.491878] slab_err+0x95/0xcd
[ 2245.495028] __kmem_cache_shutdown.cold+0x31/0x136
[ 2245.499821] kmem_cache_destroy+0x49/0x130
[ 2245.503928] btracker_destroy+0x12/0x20 [dm_cache]
[ 2245.508728] smq_destroy+0x15/0x60 [dm_cache_smq]
[ 2245.513435] dm_cache_policy_destroy+0x12/0x20 [dm_cache]
[ 2245.518834] destroy+0xc0/0x110 [dm_cache]
[ 2245.522933] dm_table_destroy+0x5c/0x120 [dm_mod]
[ 2245.527649] __dm_destroy+0x10e/0x1c0 [dm_mod]
[ 2245.532102] dev_remove+0x117/0x190 [dm_mod]
[ 2245.536384] ctl_ioctl+0x1a2/0x290 [dm_mod]
[ 2245.540579] dm_ctl_ioctl+0xa/0x20 [dm_mod]
[ 2245.544773] __x64_sys_ioctl+0x8a/0xc0
[ 2245.548524] do_syscall_64+0x5c/0x90
[ 2245.552104] ? syscall_exit_to_user_mode+0x12/0x30
[ 2245.556897] ? do_syscall_64+0x69/0x90
[ 2245.560648] ? do_syscall_64+0x69/0x90
[ 2245.564394] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 2245.569447] RIP: 0033:0x7fe52583ec6b
...
[ 2245.646771] ------------[ cut here ]------------
[ 2245.651395] kmem_cache_destroy bt_work: Slab cache still has objects when called from btracker_destroy+0x12/0x20 [dm_cache]
[ 2245.651408] WARNING: CPU: 7 PID: 10805 at mm/slab_common.c:478 kmem_cache_destroy+0x128/0x130
Found using: lvm2-testsuite --only "cache-single-split.sh"
Ben bisected and found that commit
0495e337b703 ("mm/slab_common:
Deleting kobject in kmem_cache_destroy() without holding
slab_mutex/cpu_hotplug_lock") first exposed dm-cache's incomplete
cleanup of its background tracker work objects.
Reported-by: Benjamin Marzinski <bmarzins@redhat.com>
Tested-by: Benjamin Marzinski <bmarzins@redhat.com>
Cc: stable@vger.kernel.org # 6.0+
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Mike Snitzer [Wed, 25 Jan 2023 18:01:46 +0000 (13:01 -0500)]
dm: improve shrinker debug names
Commit
e33c267ab70d ("mm: shrinkers: provide shrinkers with names")
chose some fairly bad names for DM's shrinkers.
Fixes:
e33c267ab70d ("mm: shrinkers: provide shrinkers with names")
Signed-off-by : Mike Snitzer <snitzer@kernel.org>
Ulf Hansson [Mon, 30 Jan 2023 12:12:40 +0000 (13:12 +0100)]
block: Default to use cgroup support for BFQ
Assuming that both Kconfig options, BLK_CGROUP and IOSCHED_BFQ are set, we
most likely want cgroup support for BFQ too (BFQ_GROUP_IOSCHED), so let's
make it default y.
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lore.kernel.org/r/20230130121240.159456-1-ulf.hansson@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 16 Jan 2023 09:51:53 +0000 (17:51 +0800)]
block, bfq: remove unused bfq_wr_max_time in struct bfq_data
bfqd->bfq_wr_max_time is set to 0 in bfq_init_queue and is never changed.
It is only used in bfq_wr_duration when bfq_wr_max_time > 0 which never
meets, so bfqd->bfq_wr_max_time is not used actually. Just remove it.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-9-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 16 Jan 2023 09:51:52 +0000 (17:51 +0800)]
block, bfq: remove unnecessary goto tag in bfq_dispatch_rq_from_bfqq
We jump to tag only for returning current rq. Return directly to
remove this tag.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116095153.3810101-8-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 16 Jan 2023 09:51:51 +0000 (17:51 +0800)]
block, bfq: remove redundant check in bfq_put_cooperator
We have already avoided a circular list in bfq_setup_merge (see comments
in bfq_setup_merge() for details), so bfq_queue will not appear in it's
new_bfqq list. Just remove this check.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-7-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 16 Jan 2023 09:51:50 +0000 (17:51 +0800)]
block, bfq: remove unnecessary dereference to get async_bfqq
The async_bfqq is assigned with bfqq->bic->bfqq[0], use it directly.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-6-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 16 Jan 2023 09:51:49 +0000 (17:51 +0800)]
block, bfq: use helper macro RQ_BFQQ to get bfqq of request
Use helper macro RQ_BFQQ to get bfqq of request.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-5-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 16 Jan 2023 09:51:48 +0000 (17:51 +0800)]
block, bfq: initialize bfqq->decrease_time_jif correctly
Inject limit is updated or reset when time_is_before_eq_jiffies(
decrease_time_jif + several msecs) or think-time state changes.
decrease_time_jif is initialized to 0 and will be set to current jiffies
when inject limit is updated or reset. If the jiffies is slightly greater
than LONG_MAX, time_is_after_eq_jiffies(0) will keep for a long time, so as
time_is_after_eq_jiffies(decrease_time_jif + several msecs). If the
think-time state never chages, then the injection will not work as expected
for long time.
To be more specific:
Function bfq_update_inject_limit maybe triggered when jiffies pasts
decrease_time_jif + msecs_to_jiffies(10) in bfq_add_request by setting
bfqd->wait_dispatch to true.
Function bfq_reset_inject_limit are called in two conditions:
1. jiffies pasts bfqq->decrease_time_jif + msecs_to_jiffies(1000) in
function bfq_add_request.
2. jiffies pasts bfqq->decrease_time_jif + msecs_to_jiffies(100) or
bfq think-time state change from short to long.
Fix this by initializing bfqq->decrease_time_jif to current jiffies
to trigger service injection soon when service injection conditions
are met.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-4-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 16 Jan 2023 09:51:47 +0000 (17:51 +0800)]
block, bfq: remove unsed parameter reason in bfq_bfqq_is_slow
Parameter reason is never used, just remove it.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-3-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 16 Jan 2023 09:51:46 +0000 (17:51 +0800)]
block, bfq: correctly raise inject limit in bfq_choose_bfqq_for_injection
Function bfq_choose_bfqq_for_injection may temporarily raise inject limit
to one request if current inject_limit is 0 before search of the source
queue for injection. However the search below will reset inject limit to
bfqd->in_service_queue which is zero for raised inject limit. Then the
temporarily raised inject limit never works as expected.
Assigment limit to bfqd->in_service_queue in search is needed as limit
maybe overwriten to min_t(unsigned int, 1, limit) for condition that
a large in-flight request is on non-rotational devices in found queue.
So we need to reset limit to bfqd->in_service_queue for normal case.
Actually, we have already make sure bfqd->rq_in_driver is < limit before
search, then
-Limit is >= 1 as bfqd->rq_in_driver is >= 0. Then min_t(unsigned int,
1, limit) is always 1. So we can simply check bfqd->rq_in_driver with
1 instead of result of min_t(unsigned int, 1, limit) for larget request in
non-rotational device case to avoid overwritting limit and the bug is gone.
-For normal case, we have already check bfqd->rq_in_driver is < limit,
so we can return found bfqq unconditionally to remove unncessary check.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-2-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 16 Jan 2023 20:50:59 +0000 (04:50 +0800)]
sbitmap: correct wake_batch recalculation to avoid potential IO hung
Commit
180dccb0dba4f ("blk-mq: fix tag_get wait task can't be awakened")
mentioned that in case of shared tags, there could be just one real
active hctx(queue) because of lazy detection of tag idle. Then driver tag
allocation may wait forever on this real active hctx(queue) if wake_batch
is > hctx_max_depth where hctx_max_depth is available tags depth for the
actve hctx(queue). However, the condition wake_batch > hctx_max_depth is
not strong enough to avoid IO hung as the sbitmap_queue_wake_up will only
wake up one wait queue for each wake_batch even though there is only one
waiter in the woken wait queue. After this, there is only one tag to free
and wake_batch may not be reached anymore. Commit
180dccb0dba4f ("blk-mq:
fix tag_get wait task can't be awakened") methioned that driver tag
allocation may wait forever. Actually, the inactive hctx(queue) will be
truely idle after at most 30 seconds and will call blk_mq_tag_wakeup_all
to wake one waiter per wait queue to break the hung. But IO hung for 30
seconds is also not acceptable. Set batch size to small enough that depth
of the shared hctx(queue) is enough to wake up all of the queues like
sbq_calc_wake_batch do to fix this potential IO hung.
Although hctx_max_depth will be clamped to at least 4 while wake_batch
recalculation does not do the clamp, the wake_batch will be always
recalculated to 1 when hctx_max_depth <= 4.
Fixes:
180dccb0dba4 ("blk-mq: fix tag_get wait task can't be awakened")
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116205059.3821738-6-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 16 Jan 2023 20:50:58 +0000 (04:50 +0800)]
sbitmap: add sbitmap_find_bit to remove repeat code in __sbitmap_get/__sbitmap_get_shallow
There are three differences between __sbitmap_get and
__sbitmap_get_shallow when searching free bit:
1. __sbitmap_get_shallow limit number of bit to search per word.
__sbitmap_get has no such limit.
2. __sbitmap_get_shallow always searches with wrap set. __sbitmap_get set
wrap according to round_robin.
3. __sbitmap_get_shallow always searches from first bit in first word.
__sbitmap_get searches from first bit when round_robin is not set
otherwise searches from SB_NR_TO_BIT(sb, alloc_hint).
Add helper function sbitmap_find_bit function to do common search while
accept "limit depth per word", "wrap flag" and "first bit to
search" from caller to support the need of both __sbitmap_get and
__sbitmap_get_shallow.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116205059.3821738-5-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 16 Jan 2023 20:50:57 +0000 (04:50 +0800)]
sbitmap: rewrite sbitmap_find_bit_in_index to reduce repeat code
Rewrite sbitmap_find_bit_in_index as following:
1. Rename sbitmap_find_bit_in_index to sbitmap_find_bit_in_word
2. Accept "struct sbitmap_word *" directly instead of accepting
"struct sbitmap *" and "int index" to get "struct sbitmap_word *".
3. Accept depth/shallow_depth and wrap for __sbitmap_get_word from caller
to support need of both __sbitmap_get_shallow and __sbitmap_get.
With helper function sbitmap_find_bit_in_word, we can remove repeat
code in __sbitmap_get_shallow to find bit considring deferred clear.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116205059.3821738-4-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 16 Jan 2023 20:50:56 +0000 (04:50 +0800)]
sbitmap: remove redundant check in __sbitmap_queue_get_batch
Commit
fbb564a557809 ("lib/sbitmap: Fix invalid loop in
__sbitmap_queue_get_batch()") mentioned that "Checking free bits when
setting the target bits. Otherwise, it may reuse the busying bits."
This commit add check to make sure all masked bits in word before
cmpxchg is zero. Then the existing check after cmpxchg to check any
zero bit is existing in masked bits in word is redundant.
Actually, old value of word before cmpxchg is stored in val and we
will filter out busy bits in val by "(get_mask & ~val)" after cmpxchg.
So we will not reuse busy bits methioned in commit
fbb564a557809
("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()"). Revert
new-added check to remove redundant check.
Fixes:
fbb564a55780 ("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()")
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116205059.3821738-3-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Mon, 16 Jan 2023 20:50:55 +0000 (04:50 +0800)]
sbitmap: remove unnecessary calculation of alloc_hint in __sbitmap_get_shallow
Updates to alloc_hint in the loop in __sbitmap_get_shallow() are mostly
pointless and equivalent to setting alloc_hint to zero (because
SB_NR_TO_BIT() considers only low sb->shift bits from alloc_hint). So
simplify the logic.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116205059.3821738-2-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Thu, 19 Jan 2023 11:03:50 +0000 (19:03 +0800)]
blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()
Currently parent pd can be freed before child pd:
t1: remove cgroup C1
blkcg_destroy_blkgs
blkg_destroy
list_del_init(&blkg->q_node)
// remove blkg from queue list
percpu_ref_kill(&blkg->refcnt)
blkg_release
call_rcu
t2: from t1
__blkg_release
blkg_free
schedule_work
t4: deactivate policy
blkcg_deactivate_policy
pd_free_fn
// parent of C1 is freed first
t3: from t2
blkg_free_workfn
pd_free_fn
If policy(for example, ioc_timer_fn() from iocost) access parent pd from
child pd after pd_offline_fn(), then UAF can be triggered.
Fix the problem by delaying 'list_del_init(&blkg->q_node)' from
blkg_destroy() to blkg_free_workfn(), and using a new disk level mutex to
synchronize blkg_free_workfn() and blkcg_deactivate_policy().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230119110350.2287325-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Thu, 19 Jan 2023 11:03:49 +0000 (19:03 +0800)]
blk-cgroup: support to track if policy is online
A new field 'online' is added to blkg_policy_data to fix following
2 problem:
1) In blkcg_activate_policy(), if pd_alloc_fn() with 'GFP_NOWAIT'
failed, 'queue_lock' will be dropped and pd_alloc_fn() will try again
without 'GFP_NOWAIT'. In the meantime, remove cgroup can race with
it, and pd_offline_fn() will be called without pd_init_fn() and
pd_online_fn(). This way null-ptr-deference can be triggered.
2) In order to synchronize pd_free_fn() from blkg_free_workfn() and
blkcg_deactivate_policy(), 'list_del_init(&blkg->q_node)' will be
delayed to blkg_free_workfn(), hence pd_offline_fn() can be called
first in blkg_destroy(), and then blkcg_deactivate_policy() will
call it again, we must prevent it.
The new field 'online' will be set after pd_online_fn() and will be
cleared after pd_offline_fn(), in the meantime pd_offline_fn() will only
be called if 'online' is set.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230119110350.2287325-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Thu, 19 Jan 2023 11:03:48 +0000 (19:03 +0800)]
blk-cgroup: dropping parent refcount after pd_free_fn() is done
Some cgroup policies will access parent pd through child pd even
after pd_offline_fn() is done. If pd_free_fn() for parent is called
before child, then UAF can be triggered. Hence it's better to guarantee
the order of pd_free_fn().
Currently refcount of parent blkg is dropped in __blkg_release(), which
is before pd_free_fn() is called in blkg_free_work_fn() while
blkg_free_work_fn() is called asynchronously.
This patch make sure pd_free_fn() called from removing cgroup is ordered
by delaying dropping parent refcount after calling pd_free_fn() for
child.
BTW, pd_free_fn() will also be called from blkcg_deactivate_policy()
from deleting device, and following patches will guarantee the order.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230119110350.2287325-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Zhong Jinghua [Sat, 28 Jan 2023 03:04:19 +0000 (11:04 +0800)]
blk-mq: cleanup unused methods: blk_mq_hw_sysfs_store
We found that the blk_mq_hw_sysfs_store interface has no place to use.
The object default_hw_ctx_attrs using blk_mq_hw_sysfs_ops only uses
the show method and does not use the store method.
Since this patch:
4a46f05ebf99 ("blk-mq: move hctx and ctx counters from sysfs to debugfs")
moved the store method to debugfs, the store method is not used anymore.
So let me do some tiny work to clean up unused code.
Signed-off-by: Zhong Jinghua <zhongjinghua@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230128030419.2780298-1-zhongjinghua@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 23 Jan 2023 07:53:56 +0000 (08:53 +0100)]
s390/dcssblk:: don't call bio_split_to_limits
s390 iterates over the bio using bio_for_each_segment and doesn't need
any bio splitting.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/r/20230123075356.60847-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 23 Jan 2023 07:47:18 +0000 (08:47 +0100)]
ps3vram: remove bio splitting
ps3vram iterates over the bio one segment, that is page aligned and max
page sized chunk, a time. Because of that there is no point in
calling bio_split_to_limits, or explicitly setting the default limits
that are only used by bio_split_to_limits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Geoff Levand <geoff@infradead.org>
Link: https://lore.kernel.org/r/20230123074718.57951-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 20 Jan 2023 14:51:07 +0000 (07:51 -0700)]
block: treat poll queue enter similarly to timeouts
We ran into an issue where a production workload would randomly grind to
a halt and not continue until the pending IO had timed out. This turned
out to be a complicated interaction between queue freezing and polled
IO:
1) You have an application that does polled IO. At any point in time,
there may be polled IO pending.
2) You have a monitoring application that issues a passthrough command,
which is marked with side effects such that it needs to freeze the
queue.
3) Passthrough command is started, which calls blk_freeze_queue_start()
on the device. At this point the queue is marked frozen, and any
attempt to enter the queue will fail (for non-blocking) or block.
4) Now the driver calls blk_mq_freeze_queue_wait(), which will return
when the queue is quiesced and pending IO has completed.
5) The pending IO is polled IO, but any attempt to poll IO through the
normal iocb_bio_iopoll() -> bio_poll() will fail when it gets to
bio_queue_enter() as the queue is frozen. Rather than poll and
complete IO, the polling threads will sit in a tight loop attempting
to poll, but failing to enter the queue to do so.
The end result is that progress for either application will be stalled
until all pending polled IO has timed out. This causes obvious huge
latency issues for the application doing polled IO, but also long delays
for passthrough command.
Fix this by treating queue enter for polled IO just like we do for
timeouts. This allows quick quiesce of the queue as we still poll and
complete this IO, while still disallowing queueing up new IO.
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Li Nan [Tue, 17 Jan 2023 07:08:06 +0000 (15:08 +0800)]
blk-iocost: change div64_u64 to DIV64_U64_ROUND_UP in ioc_refresh_params()
vrate_min is calculated by DIV64_U64_ROUND_UP, but vrate_max is calculated
by div64_u64. Vrate_min may be 1 greater than vrate_max if the input
values min and max of cost.qos are equal.
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230117070806.3857142-6-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Li Nan [Tue, 17 Jan 2023 07:08:05 +0000 (15:08 +0800)]
blk-iocost: fix divide by 0 error in calc_lcoefs()
echo max of u64 to cost.model can cause divide by 0 error.
# echo 8:0 rbps=
18446744073709551615 > /sys/fs/cgroup/io.cost.model
divide error: 0000 [#1] PREEMPT SMP
RIP: 0010:calc_lcoefs+0x4c/0xc0
Call Trace:
<TASK>
ioc_refresh_params+0x2b3/0x4f0
ioc_cost_model_write+0x3cb/0x4c0
? _copy_from_iter+0x6d/0x6c0
? kernfs_fop_write_iter+0xfc/0x270
cgroup_file_write+0xa0/0x200
kernfs_fop_write_iter+0x17d/0x270
vfs_write+0x414/0x620
ksys_write+0x73/0x160
__x64_sys_write+0x1e/0x30
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
calc_lcoefs() uses the input value of cost.model in DIV_ROUND_UP_ULL,
overflow would happen if bps plus IOC_PAGE_SIZE is greater than
ULLONG_MAX, it can cause divide by 0 error.
Fix the problem by setting basecost
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230117070806.3857142-5-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Tue, 17 Jan 2023 07:08:04 +0000 (15:08 +0800)]
blk-iocost: read params inside lock in sysfs apis
Otherwise, user might get abnormal values if params is updated
concurrently.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230117070806.3857142-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Tue, 17 Jan 2023 07:08:03 +0000 (15:08 +0800)]
blk-iocost: don't allow to configure bio based device
iocost is based on rq_qos, which can only work for request based device,
thus it doesn't make sense to configure iocost for bio based device.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230117070806.3857142-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Tue, 17 Jan 2023 07:08:02 +0000 (15:08 +0800)]
blk-iocost: check return value of match_u64()
This patch fixs that the return value of match_u64() from ioc_qos_write()
is not checked,
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230117070806.3857142-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Arnd Bergmann [Wed, 18 Jan 2023 08:07:01 +0000 (09:07 +0100)]
blk-iocost: avoid 64-bit division in ioc_timer_fn
The behavior of 'enum' types has changed in gcc-13, so now the
UNBUSY_THR_PCT constant is interpreted as a 64-bit number because
it is defined as part of the same enum definition as some other
constants that do not fit within a 32-bit integer. This in turn
leads to some inefficient code on 32-bit architectures as well
as a link error:
arm-linux-gnueabi/bin/arm-linux-gnueabi-ld: block/blk-iocost.o: in function `ioc_timer_fn':
blk-iocost.c:(.text+0x68e8): undefined reference to `__aeabi_uldivmod'
arm-linux-gnueabi-ld: blk-iocost.c:(.text+0x6908): undefined reference to `__aeabi_uldivmod'
Split the enum definition to keep the 64-bit timing constants in
a separate enum type from those constants that can clearly fit
within a smaller type.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230118080706.3303186-1-arnd@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 18 Jan 2023 04:23:18 +0000 (12:23 +0800)]
block: ublk: fix doc build warning
Fix the following warning:
Documentation/block/ublk.rst:157: WARNING: Enumerated list ends without a blank line; unexpected unindent.
Documentation/block/ublk.rst:171: WARNING: Enumerated list ends without a blank line; unexpected unindent.
Fixes:
56f5160bc1b8 ("ublk_drv: add mechanism for supporting unprivileged ublk device")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230118042318.127900-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pankaj Raghav [Tue, 10 Jan 2023 14:36:35 +0000 (15:36 +0100)]
block: introduce bdev_zone_no helper
Add a generic bdev_zone_no() helper to calculate zone number for a
given sector in a block device. This helper internally uses disk_zone_no()
to find the zone number.
Use the helper bdev_zone_no() to calculate nr of zones. This lets us
make modifications to the math if needed in one place.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20230110143635.77300-4-p.raghav@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pankaj Raghav [Tue, 10 Jan 2023 14:36:34 +0000 (15:36 +0100)]
block: add a new helper bdev_{is_zone_start, offset_from_zone_start}
Instead of open coding to check for zone start, add a helper to improve
readability and store the logic in one place.
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230110143635.77300-3-p.raghav@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pankaj Raghav [Tue, 10 Jan 2023 14:36:33 +0000 (15:36 +0100)]
block: remove superfluous check for request queue in bdev_is_zoned()
Remove the superfluous request queue check in bdev_is_zoned() as
bdev_get_queue() can never return NULL.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20230110143635.77300-2-p.raghav@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Anuj Gupta [Tue, 17 Jan 2023 12:06:38 +0000 (17:36 +0530)]
block: extend bio-cache for non-polled requests
This patch modifies the present check, so that bio-cache is not limited
to iopoll.
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20230117120638.72254-3-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Anuj Gupta [Tue, 17 Jan 2023 12:06:37 +0000 (17:36 +0530)]
nvme: set REQ_ALLOC_CACHE for uring-passthru request
This patch sets REQ_ALLOC_CACHE flag for uring-passthru requests.
This is a prep-patch so that normal / IRQ-driven uring-passthru
I/Os can also leverage bio-cache.
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20230117120638.72254-2-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 6 Jan 2023 04:17:11 +0000 (12:17 +0800)]
ublk_drv: add mechanism for supporting unprivileged ublk device
unprivileged ublk device is helpful for container use case, such
as: ublk device created in one unprivileged container can be controlled
and accessed by this container only.
Implement this feature by adding flag of UBLK_F_UNPRIVILEGED_DEV, and if
this flag isn't set, any control command has been run from privileged
user. Otherwise, any control command can be sent from any unprivileged
user, but the user has to be permitted to access the ublk char device
to be controlled.
In case of UBLK_F_UNPRIVILEGED_DEV:
1) for command UBLK_CMD_ADD_DEV, it is always allowed, and user needs
to provide owner's uid/gid in this command, so that udev can set correct
ownership for the created ublk device, since the device owner uid/gid
can be queried via command of UBLK_CMD_GET_DEV_INFO.
2) for other control commands, they can only be run successfully if the
current user is allowed to access the specified ublk char device, for
running the permission check, path of the ublk char device has to be
provided by these commands.
Also add one control of command UBLK_CMD_GET_DEV_INFO2 which always
include the char dev path in payload since userspace may not have
knowledge if this device is created in unprivileged mode.
For applying this mechanism, system administrator needs to take
the following policies:
1) chmod 0666 /dev/ublk-control
2) change ownership of ublkcN & ublkbN
- chown owner_uid:owner_gid /dev/ublkcN
- chown owner_uid:owner_gid /dev/ublkbN
Both can be done via one simple udev rule.
Userspace:
https://github.com/ming1/ubdsrv/tree/unprivileged-ublk
'ublk add -t $TYPE --un_privileged=1' is for creating one un-privileged
ublk device if the user is un-privileged.
Link: https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230106041711.914434-7-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 6 Jan 2023 04:17:10 +0000 (12:17 +0800)]
ublk_drv: add module parameter of ublks_max for limiting max allowed ublk dev
Prepare for supporting unprivileged ublk device by limiting max number
ublk devices added. Otherwise too many ublk devices could be added by
un-trusted user, which can be thought as one DoS.
Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230106041711.914434-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 6 Jan 2023 04:17:09 +0000 (12:17 +0800)]
ublk_drv: add device parameter UBLK_PARAM_TYPE_DEVT
Userspace side only knows device ID, but the associated path of ublkc* and
ublkb* could be changed by udev, and that depends on userspace's policy, so
add parameter of UBLK_PARAM_TYPE_DEVT for retrieving major/minor of the
ublkc* and ublkb*, then user may figure out major/minor of the ublk disks
he/she owns. With major/minor, it is easy to find the device node path.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230106041711.914434-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 6 Jan 2023 04:17:08 +0000 (12:17 +0800)]
ublk_drv: move ublk_get_device_from_id into ublk_ctrl_uring_cmd
It is annoying for each control command handler to get/put ublk
device and deal with failure.
Control command handler is simplified a lot by moving
ublk_get_device_from_id into ublk_ctrl_uring_cmd().
Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230106041711.914434-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 6 Jan 2023 04:17:07 +0000 (12:17 +0800)]
ublk_drv: don't probe partitions if the ubq daemon isn't trusted
If any ubq daemon is unprivileged, the ublk char device is allowed
for unprivileged user actually, and we can't trust the current user,
so not probe partitions.
Fixes:
71f28f3136af ("ublk_drv: add io_uring based userspace block driver")
Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230106041711.914434-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 6 Jan 2023 04:17:06 +0000 (12:17 +0800)]
ublk_drv: remove nr_aborted_queues from ublk_device
No one uses 'nr_aborted_queues' any more, so remove it.
Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230106041711.914434-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Mon, 16 Jan 2023 15:55:53 +0000 (08:55 -0700)]
block: don't allow multiple bios for IOCB_NOWAIT issue
If we're doing a large IO request which needs to be split into multiple
bios for issue, then we can run into the same situation as the below
marked commit fixes - parts will complete just fine, one or more parts
will fail to allocate a request. This will result in a partially
completed read or write request, where the caller gets EAGAIN even though
parts of the IO completed just fine.
Do the same for large bios as we do for splits - fail a NOWAIT request
with EAGAIN. This isn't technically fixing an issue in the below marked
patch, but for stable purposes, we should have either none of them or
both.
This depends on:
613b14884b85 ("block: handle bio_split_to_limits() NULL return")
Cc: stable@vger.kernel.org # 5.15+
Fixes:
9cea62b2cbab ("block: don't allow splitting of a REQ_NOWAIT bio")
Link: https://github.com/axboe/liburing/issues/766
Reported-and-tested-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Andreas Gruenbacher [Fri, 13 Jan 2023 12:35:38 +0000 (13:35 +0100)]
drbd: drbd_insert_interval(): Clarify comment
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20230113123538.144276-9-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Lars Ellenberg [Fri, 13 Jan 2023 12:35:37 +0000 (13:35 +0100)]
drbd: interval tree: make removing an "empty" interval a no-op
Trying to remove an "empty" (just initialized, or "cleared") interval
from the tree, this results in an endless loop.
As we typically protect the tree with a spinlock_irq,
the result is a hung system.
Be nice to error cleanup code paths, ignore removal of empty intervals.
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20230113123538.144276-8-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Böhmwalder [Fri, 13 Jan 2023 12:35:36 +0000 (13:35 +0100)]
MAINTAINERS: add drbd headers
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Reviewed-by: Joel Colledge <joel.colledge@linbit.com>
Link: https://lore.kernel.org/r/20230113123538.144276-7-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Böhmwalder [Fri, 13 Jan 2023 12:35:35 +0000 (13:35 +0100)]
drbd: remove macros using require_context
This require_context attribute originated in a proposed sparse patch by
Philipp Reisner back in 2008. Johannes Berg had a different solution to
a similar problem, and that patch "won" in the end; so the require_context
thing never got merged. The whole history can be read at [0].
DRBD kept using these annotations anyway for a while. Nowadays, on a
modern unmodified sparse, they obviously do nothing, and they are hardly
used anymore anyway.
So, just remove the definitions of these macros.
[0] https://www.spinics.net/lists/linux-sparse/msg01150.html
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Reviewed-by: Joel Colledge <joel.colledge@linbit.com>
Link: https://lore.kernel.org/r/20230113123538.144276-6-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Böhmwalder [Fri, 13 Jan 2023 12:35:34 +0000 (13:35 +0100)]
drbd: remove unnecessary assignment in vli_encode_bits
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Reviewed-by: Joel Colledge <joel.colledge@linbit.com>
Link: https://lore.kernel.org/r/20230113123538.144276-5-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Böhmwalder [Fri, 13 Jan 2023 12:35:33 +0000 (13:35 +0100)]
drbd: make limits unsigned
These are almost always used as unsigned integers, so mark them as such.
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Reviewed-by: Joel Colledge <joel.colledge@linbit.com>
Link: https://lore.kernel.org/r/20230113123538.144276-4-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Robert Altnoeder [Fri, 13 Jan 2023 12:35:32 +0000 (13:35 +0100)]
drbd: fix DRBD_VOLUME_MAX 65535 -> 65534
The protocol uses -1 as a reserved value for
'no specific volume', and since the protocol field
is a 16 bit unsigned value, -1 is converted to
65535. Therefore, limit the range of valid volume
numbers to [0, 65534].
Signed-off-by: Robert Altnoeder <robert.altnoeder@linbit.com>
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Reviewed-by: Joel Colledge <joel.colledge@linbit.com>
Link: https://lore.kernel.org/r/20230113123538.144276-3-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>