Vlastimil Babka [Mon, 21 Nov 2022 09:37:36 +0000 (10:37 +0100)]
Merge branch 'slab/for-6.2/alloc_size' into slab/for-next
Two patches from Kees Cook [1]:
These patches work around a deficiency in GCC (>=11) and Clang (<16)
where the __alloc_size attribute does not apply to inlines.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96503
This manifests as reduced overflow detection coverage for many allocation
sites under CONFIG_FORTIFY_SOURCE=y, where the allocation size was not
actually being propagated to __builtin_dynamic_object_size().
[1] https://lore.kernel.org/all/
20221118034713.gonna.754-kees@kernel.org/
Vlastimil Babka [Fri, 11 Nov 2022 08:08:18 +0000 (09:08 +0100)]
Merge branch 'slab/for-6.2/kmalloc_redzone' into slab/for-next
kmalloc() redzone improvements by Feng Tang
From cover letter [1]:
kmalloc's API family is critical for mm, and one of its nature is that
it will round up the request size to a fixed one (mostly power of 2).
When user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes
could be allocated, so there is an extra space than what is originally
requested.
This patchset tries to extend the redzone sanity check to the extra
kmalloced buffer than requested, to better detect un-legitimate access
to it. (depends on SLAB_STORE_USER & SLAB_RED_ZONE)
[1] https://lore.kernel.org/all/
20221021032405.1825078-1-feng.tang@intel.com/
Vlastimil Babka [Thu, 10 Nov 2022 09:42:34 +0000 (10:42 +0100)]
Merge branch 'slab/for-6.2/fit_rcu_head' into slab/for-next
A series by myself to reorder fields in struct slab to allow the
embedded rcu_head to grow (for debugging purposes). Requires changes to
isolate_movable_page() to skip slab pages which can otherwise become
false-positive __PageMovable due to its use of low bits in
page->mapping.
Vlastimil Babka [Thu, 10 Nov 2022 09:41:06 +0000 (10:41 +0100)]
Merge branch 'slab/for-6.2/tools' into slab/for-next
A patch for tools/vm/slabinfo to give more useful feedback when not run
as a root, by Rong Tao.
Vlastimil Babka [Thu, 10 Nov 2022 08:53:26 +0000 (09:53 +0100)]
Merge branch 'slab/for-6.2/slub-sysfs' into slab/for-next
- Two patches for SLUB's sysfs by Rasmus Villemoes to remove dead code
and optimize boot time with late initialization.
- Allow SLUB's sysfs 'failslab' parameter to be runtime-controllable
again as it can be both useful and safe, by Alexander Atanasov.
Vlastimil Babka [Thu, 10 Nov 2022 08:48:37 +0000 (09:48 +0100)]
Merge branch 'slab/for-6.2/locking' into slab/for-next
A patch from Jiri Kosina that makes SLAB's list_lock a raw_spinlock_t.
While there are no plans to make SLAB actually compatible with
PREEMPT_RT or any other future, it makes !PREEMPT_RT lockdep happy.
Vlastimil Babka [Thu, 10 Nov 2022 08:44:07 +0000 (09:44 +0100)]
Merge branch 'slab/for-6.2/cleanups' into slab/for-next
- Removal of dead code from deactivate_slab() by Hyeonggon Yoo.
- Fix of BUILD_BUG_ON() for sufficient early percpu size by Baoquan He.
- Make kmem_cache_alloc() kernel-doc less misleading, by myself.
Kees Cook [Fri, 18 Nov 2022 03:51:59 +0000 (19:51 -0800)]
slab: Remove special-casing of const 0 size allocations
Passing a constant-0 size allocation into kmalloc() or kmalloc_node()
does not need to be a fast-path operation, so the static return value
can be removed entirely. This makes sure that all paths through the
inlines result in a full extern function call, where __alloc_size()
hints will actually be seen[1] by GCC. (A constant return value of 0
means the "0" allocation size won't be propagated by the inline.)
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96503
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Kees Cook [Fri, 18 Nov 2022 03:51:58 +0000 (19:51 -0800)]
slab: Clean up SLOB vs kmalloc() definition
As already done for kmalloc_node(), clean up the #ifdef usage in the
definition of kmalloc() so that the SLOB-only version is an entirely
separate and much more readable function.
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Vlastimil Babka [Fri, 26 Aug 2022 09:09:12 +0000 (11:09 +0200)]
mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head
Joel reports [1] that increasing the rcu_head size for debugging
purposes used to work before struct slab was split from struct page, but
now runs into the various SLAB_MATCH() sanity checks of the layout.
This is because the rcu_head in struct page is in union with large
sub-structures and has space to grow without exceeding their size, while
in struct slab (for SLAB and SLUB) it's in union only with a list_head.
On closer inspection (and after the previous patch) we can put all
fields except slab_cache to a union with rcu_head, as slab_cache is
sufficient for the rcu freeing callbacks to work and the rest can be
overwritten by rcu_head without causing issues.
This is only somewhat complicated by the need to keep SLUB's
freelist+counters aligned for cmpxchg_double. As a result the fields
need to be reordered so that slab_cache is first (after page flags) and
the union with rcu_head follows. For consistency, do that for SLAB as
well, although not necessary there.
As a result, the rcu_head field in struct page and struct slab is no
longer at the same offset, but that doesn't matter as there is no
casting that would rely on that in the slab freeing callbacks, so we can
just drop the respective SLAB_MATCH() check.
Also we need to update the SLAB_MATCH() for compound_head to reflect the
new ordering.
While at it, also add a static_assert to check the alignment needed for
cmpxchg_double so mistakes are found sooner than a runtime GPF.
[1] https://lore.kernel.org/all/
85afd876-d8bb-0804-b2c5-
48ed3055e702@joelfernandes.org/
Reported-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Vlastimil Babka [Fri, 4 Nov 2022 14:57:26 +0000 (15:57 +0100)]
mm/migrate: make isolate_movable_page() skip slab pages
In the next commit we want to rearrange struct slab fields to allow a larger
rcu_head. Afterwards, the page->mapping field will overlap with SLUB's "struct
list_head slab_list", where the value of prev pointer can become LIST_POISON2,
which is 0x122 + POISON_POINTER_DELTA. Unfortunately the bit 1 being set can
confuse PageMovable() to be a false positive and cause a GPF as reported by lkp
[1].
To fix this, make isolate_movable_page() skip pages with the PageSlab flag set.
This is a bit tricky as we need to add memory barriers to SLAB and SLUB's page
allocation and freeing, and their counterparts to isolate_movable_page().
Based on my RFC from [2]. Added a comment update from Matthew's variant in [3]
and, as done there, moved the PageSlab checks to happen before trying to take
the page lock.
[1] https://lore.kernel.org/all/
208c1757-5edd-fd42-67d4-
1940cc43b50f@intel.com/
[2] https://lore.kernel.org/all/
aec59f53-0e53-1736-5932-
25407125d4d4@suse.cz/
[3] https://lore.kernel.org/all/YzsVM8eToHUeTP75@casper.infradead.org/
Reported-by: kernel test robot <yujie.liu@intel.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Vlastimil Babka [Thu, 10 Nov 2022 08:10:30 +0000 (09:10 +0100)]
mm/slab: move and adjust kernel-doc for kmem_cache_alloc
Alexander reports an issue with the kmem_cache_alloc() comment in
mm/slab.c:
> The current comment mentioned that the flags only matters if the
> cache has no available objects. It's different for the __GFP_ZERO
> flag which will ensure that the returned object is always zeroed
> in any case.
> I have the feeling I run into this question already two times if
> the user need to zero the object or not, but the user does not need
> to zero the object afterwards. However another use of __GFP_ZERO
> and only zero the object if the cache has no available objects would
> also make no sense.
and suggests thus mentioning __GFP_ZERO as the exception. But on closer
inspection, the part about flags being only relevant if cache has no
available objects is misleading. The slab user has no reliable way to
determine if there are available objects, and e.g. the might_sleep()
debug check can be performed even if objects are available, so passing
correct flags given the allocation context always matters.
Thus remove that sentence completely, and while at it, move the comment
to from SLAB-specific mm/slab.c to the common include/linux/slab.h
The comment otherwise refers flags description for kmalloc(), so add
__GFP_ZERO comment there and remove a very misleading GFP_HIGHUSER
(not applicable to slab) description from there. Mention kzalloc() and
kmem_cache_zalloc() shortcuts.
Reported-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/all/20221011145413.8025-1-aahringo@redhat.com/
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Baoquan He [Mon, 24 Oct 2022 08:14:35 +0000 (16:14 +0800)]
mm/slub, percpu: correct the calculation of early percpu allocation size
SLUB allocator relies on percpu allocator to initialize its ->cpu_slab
during early boot. For that, the dynamic chunk of percpu which serves
the early allocation need be large enough to satisfy the kmalloc
creation.
However, the current BUILD_BUG_ON() in alloc_kmem_cache_cpus() doesn't
consider the kmalloc array with NR_KMALLOC_TYPES length. Fix that
with correct calculation.
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Baoquan He [Sun, 13 Nov 2022 10:08:27 +0000 (18:08 +0800)]
percpu: adjust the value of PERCPU_DYNAMIC_EARLY_SIZE
LKP reported a build failure as below on the following patch "mm/slub,
percpu: correct the calculation of early percpu allocation size"
~~~~~~
In file included from <command-line>:
In function 'alloc_kmem_cache_cpus',
inlined from 'kmem_cache_open' at mm/slub.c:4340:6:
>> >> include/linux/compiler_types.h:357:45: error: call to '__compiletime_assert_474' declared with attribute error:
BUILD_BUG_ON failed: PERCPU_DYNAMIC_EARLY_SIZE < NR_KMALLOC_TYPES * KMALLOC_SHIFT_HIGH * sizeof(struct kmem_cache_cpu)
357 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
~~~~~~
From the kernel config file provided by LKP, the building was made on
arm64 with below Kconfig item enabled:
CONFIG_ZONE_DMA=y
CONFIG_SLUB_CPU_PARTIAL=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_SLUB_STATS=y
CONFIG_ARM64_PAGE_SHIFT=16
CONFIG_ARM64_64K_PAGES=y
Then we will have:
NR_KMALLOC_TYPES:4
KMALLOC_SHIFT_HIGH:17
sizeof(struct kmem_cache_cpu):184
The product of them is 12512, which is bigger than PERCPU_DYNAMIC_EARLY_SIZE,
12K. Hence, the BUILD_BUG_ON in alloc_kmem_cache_cpus() is triggered.
Earlier, in commit
099a19d91ca4 ("percpu: allow limited allocation
before slab is online"), PERCPU_DYNAMIC_EARLY_SIZE was introduced and
set to 12K which is equal to the then PERPCU_DYNAMIC_RESERVE.
Later, in commit
1a4d76076cda ("percpu: implement asynchronous chunk
population"), PERPCU_DYNAMIC_RESERVE was increased by 8K, while
PERCPU_DYNAMIC_EARLY_SIZE was kept unchanged.
So, here increase PERCPU_DYNAMIC_EARLY_SIZE by 8K too to accommodate to
the slub's requirement.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Feng Tang [Fri, 21 Oct 2022 03:24:05 +0000 (11:24 +0800)]
mm/slub: extend redzone check to extra allocated kmalloc space than requested
kmalloc will round up the request size to a fixed size (mostly power
of 2), so there could be a extra space than what is requested, whose
size is the actual buffer size minus original request size.
To better detect out of bound access or abuse of this space, add
redzone sanity check for it.
In current kernel, some kmalloc user already knows the existence of
the space and utilizes it after calling 'ksize()' to know the real
size of the allocated buffer. So we skip the sanity check for objects
which have been called with ksize(), as treating them as legitimate
users. Kees Cook is working on sanitizing all these user cases,
by using kmalloc_size_roundup() to avoid ambiguous usages. And after
this is done, this special handling for ksize() can be removed.
In some cases, the free pointer could be saved inside the latter
part of object data area, which may overlap the redzone part(for
small sizes of kmalloc objects). As suggested by Hyeonggon Yoo,
force the free pointer to be in meta data area when kmalloc redzone
debug is enabled, to make all kmalloc objects covered by redzone
check.
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Feng Tang [Fri, 21 Oct 2022 03:24:04 +0000 (11:24 +0800)]
mm: kasan: Extend kasan_metadata_size() to also cover in-object size
When kasan is enabled for slab/slub, it may save kasan' free_meta
data in the former part of slab object data area in slab object's
free path, which works fine.
There is ongoing effort to extend slub's debug function which will
redzone the latter part of kmalloc object area, and when both of
the debug are enabled, there is possible conflict, especially when
the kmalloc object has small size, as caught by 0Day bot [1].
To solve it, slub code needs to know the in-object kasan's meta
data size. Currently, there is existing kasan_metadata_size()
which returns the kasan's metadata size inside slub's metadata
area, so extend it to also cover the in-object meta size by
adding a boolean flag 'in_object'.
There is no functional change to existing code logic.
[1]. https://lore.kernel.org/lkml/YuYm3dWwpZwH58Hu@xsang-OptiPlex-9020/
Reported-by: kernel test robot <oliver.sang@intel.com>
Suggested-by: Andrey Konovalov <andreyknvl@gmail.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Feng Tang [Fri, 21 Oct 2022 03:24:03 +0000 (11:24 +0800)]
mm/slub: only zero requested size of buffer for kzalloc when debug enabled
kzalloc/kmalloc will round up the request size to a fixed size
(mostly power of 2), so the allocated memory could be more than
requested. Currently kzalloc family APIs will zero all the
allocated memory.
To detect out-of-bound usage of the extra allocated memory, only
zero the requested part, so that redzone sanity check could be
added to the extra space later.
For kzalloc users who will call ksize() later and utilize this
extra space, please be aware that the space is not zeroed any
more when debug is enabled. (Thanks to Kees Cook's effort to
sanitize all ksize() user cases [1], this won't be a big issue).
[1]. https://lore.kernel.org/all/
20220922031013.2150682-1-keescook@chromium.org/#r
Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Rong Tao [Tue, 8 Nov 2022 12:47:57 +0000 (20:47 +0800)]
tools/vm/slabinfo: indicates the cause of the EACCES error
If you don't run slabinfo with a superuser, return 0 when read_slab_dir()
reads get_obj_and_str("slabs", &t), because fopen() fails (sometimes
EACCES), causing slabcache() to return directly, without any error during
this time, we should tell the user about the EACCES problem instead of
running successfully($?=0) without any error printing.
For example:
$ ./slabinfo
Permission denied, Try using superuser <== What this submission did
$ sudo ./slabinfo
Name Objects Objsize Space Slabs/Part/Cpu O/S O %Fr %Ef Flg
Acpi-Namespace 5950 48 286.7K 65/0/5 85 0 0 99
Acpi-Operand 13664 72 999.4K 231/0/13 56 0 0 98
...
Signed-off-by: Rong Tao <rongtao@cestc.cn>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Vlastimil Babka [Mon, 7 Nov 2022 16:11:27 +0000 (17:11 +0100)]
mm, slab: remove duplicate kernel-doc comment for ksize()
Akira reports:
> "make htmldocs" reports duplicate C declaration of ksize() as follows:
> /linux/Documentation/core-api/mm-api:43: ./mm/slab_common.c:1428: WARNING: Duplicate C declaration, also defined at core-api/mm-api:212.
> Declaration is '.. c:function:: size_t ksize (const void *objp)'.
> This is due to the kernel-doc comment for ksize() declaration added in
> include/linux/slab.h by commit
05a940656e1e ("slab: Introduce
> kmalloc_size_roundup()").
There is an older kernel-doc comment for ksize() definition in
mm/slab_common.c, which is not only duplicated, but also contradicts the
new one - the additional storage discovered by ksize() should not be
used by callers anymore. Delete the old kernel-doc.
Reported-by: Akira Yokosawa <akiyks@gmail.com>
Link: https://lore.kernel.org/all/d33440f6-40cf-9747-3340-e54ffaf7afb8@gmail.com/
Fixes:
05a940656e1e ("slab: Introduce kmalloc_size_roundup()")
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Kees Cook [Sat, 5 Nov 2022 06:35:34 +0000 (23:35 -0700)]
mm/slab_common: Restore passing "caller" for tracing
The "caller" argument was accidentally being ignored in a few places
that were recently refactored. Restore these "caller" arguments, instead
of _RET_IP_.
Fixes:
11e9734bcb6a ("mm/slab_common: unify NUMA and UMA version of tracepoints")
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Vlastimil Babka [Fri, 4 Nov 2022 12:57:11 +0000 (13:57 +0100)]
mm/slab: remove !CONFIG_TRACING variants of kmalloc_[node_]trace()
For !CONFIG_TRACING kernels, the kmalloc() implementation tries (in cases where
the allocation size is build-time constant) to save a function call, by
inlining kmalloc_trace() to a kmem_cache_alloc() call.
However since commit
6edf2576a6cc ("mm/slub: enable debugging memory wasting of
kmalloc") this path now fails to pass the original request size to be
eventually recorded (for kmalloc caches with debugging enabled).
We could adjust the code to call __kmem_cache_alloc_node() as the
CONFIG_TRACING variant, but that would as a result inline a call with 5
parameters, bloating the kmalloc() call sites. The cost of extra function
call (to kmalloc_trace()) seems like a lesser evil.
It also appears that the !CONFIG_TRACING variant is incompatible with upcoming
hardening efforts [1] so it's easier if we just remove it now. Kernels with no
tracing are rare these days and the benefit is dubious anyway.
[1] https://lore.kernel.org/linux-mm/
20221101222520.never.109-kees@kernel.org/T/#m20ecf14390e406247bde0ea9cce368f469c539ed
Link: https://lore.kernel.org/all/097d8fba-bd10-a312-24a3-a4068c4f424c@suse.cz/
Suggested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Lukas Bulwahn [Mon, 31 Oct 2022 09:29:20 +0000 (10:29 +0100)]
mm/slab_common: repair kernel-doc for __ksize()
Commit
445d41d7a7c1 ("Merge branch 'slab/for-6.1/kmalloc_size_roundup' into
slab/for-next") resolved a conflict of two concurrent changes to __ksize().
However, it did not adjust the kernel-doc comment of __ksize(), while the
name of the argument to __ksize() was renamed.
Hence, ./scripts/ kernel-doc -none mm/slab_common.c warns about it.
Adjust the kernel-doc comment for __ksize() for make W=1 happiness.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Vlastimil Babka [Fri, 26 Aug 2022 09:09:11 +0000 (11:09 +0200)]
mm/slub: perform free consistency checks before call_rcu
For SLAB_TYPESAFE_BY_RCU caches we use call_rcu to perform empty slab
freeing. The rcu callback rcu_free_slab() calls __free_slab() that
currently includes checking the slab consistency for caches with
SLAB_CONSISTENCY_CHECKS flags. This check needs the slab->objects field
to be intact.
Because in the next patch we want to allow rcu_head in struct slab to
become larger in debug configurations and thus potentially overwrite
more fields through a union than slab_list, we want to limit the fields
used in rcu_free_slab(). Thus move the consistency checks to
free_slab() before call_rcu(). This can be done safely even for
SLAB_TYPESAFE_BY_RCU caches where accesses to the objects can still
occur after freeing them.
As a result, only the slab->slab_cache field has to be physically
separate from rcu_head for the freeing callback to work. We also save
some cycles in the rcu callback for caches with consistency checks
enabled.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Jiri Kosina [Fri, 21 Oct 2022 19:18:12 +0000 (21:18 +0200)]
mm/slab: Annotate kmem_cache_node->list_lock as raw
The list_lock can be taken in hardirq context when do_drain() is being
called via IPI on all cores, and therefore lockdep complains about it,
because it can't be preempted on PREEMPT_RT.
That's not a real issue, as SLAB can't be built on PREEMPT_RT anyway, but
we still want to get rid of the warning on non-PREEMPT_RT builds.
Annotate it therefore as a raw lock in order to get rid of he lockdep
warning below.
=============================
[ BUG: Invalid wait context ]
6.1.0-rc1-00134-ge35184f32151 #4 Not tainted
-----------------------------
swapper/3/0 is trying to lock:
ffff8bc88086dc18 (&parent->list_lock){..-.}-{3:3}, at: do_drain+0x57/0xb0
other info that might help us debug this:
context-{2:2}
no locks held by swapper/3/0.
stack backtrace:
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.1.0-rc1-00134-ge35184f32151 #4
Hardware name: LENOVO 20K5S22R00/20K5S22R00, BIOS R0IET38W (1.16 ) 05/31/2017
Call Trace:
<IRQ>
dump_stack_lvl+0x6b/0x9d
__lock_acquire+0x1519/0x1730
? build_sched_domains+0x4bd/0x1590
? __lock_acquire+0xad2/0x1730
lock_acquire+0x294/0x340
? do_drain+0x57/0xb0
? sched_clock_tick+0x41/0x60
_raw_spin_lock+0x2c/0x40
? do_drain+0x57/0xb0
do_drain+0x57/0xb0
__flush_smp_call_function_queue+0x138/0x220
__sysvec_call_function+0x4f/0x210
sysvec_call_function+0x4b/0x90
</IRQ>
<TASK>
asm_sysvec_call_function+0x16/0x20
RIP: 0010:mwait_idle+0x5e/0x80
Code: 31 d2 65 48 8b 04 25 80 ed 01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d 0b 78 46 00 31 c0 48 89 c1 fb 0f 01 c9 <eb> 06 fb 0f 1f 44 00 00 65 48 8b 04 25 80 ed 01 00 f0 80 60 02 df
RSP: 0000:
ffffa90940217ee0 EFLAGS:
00000246
RAX:
0000000000000000 RBX:
0000000000000000 RCX:
0000000000000000
RDX:
0000000000000000 RSI:
0000000000000000 RDI:
ffffffff9bb9f93a
RBP:
0000000000000003 R08:
0000000000000001 R09:
0000000000000001
R10:
ffffa90940217ea8 R11:
0000000000000000 R12:
ffffffffffffffff
R13:
0000000000000000 R14:
ffff8bc88127c500 R15:
0000000000000000
? default_idle_call+0x1a/0xa0
default_idle_call+0x4b/0xa0
do_idle+0x1f1/0x2c0
? _raw_spin_unlock_irqrestore+0x56/0x70
cpu_startup_entry+0x19/0x20
start_secondary+0x122/0x150
secondary_startup_64_no_verify+0xce/0xdb
</TASK>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Hyeonggon Yoo [Fri, 14 Oct 2022 11:43:22 +0000 (20:43 +0900)]
mm/slub: remove dead code for debug caches on deactivate_slab()
After commit
c7323a5ad0786 ("mm/slub: restrict sysfs validation to debug
caches and make it safe"), SLUB never installs percpu slab for debug caches
and thus never deactivates percpu slab for them.
Since only debug caches use the full list, SLUB no longer deactivates to
full list. Remove dead code in deactivate_slab().
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Alexander Atanasov [Tue, 20 Sep 2022 12:11:11 +0000 (15:11 +0300)]
mm: Make failslab writable again
In (
060807f841ac mm, slub: make remaining slub_debug related attributes
read-only) failslab was made read-only.
I think it became a collateral victim to the two other options for which
the reasons are perfectly valid.
Here is why:
- sanity_checks and trace are slab internal debug options,
failslab is used for fault injection.
- for fault injections, which by presumption are random, it
does not matter if it is not set atomically. And you need to
set atleast one more option to trigger fault injection.
- in a testing scenario you may need to change it at runtime
example: module loading - you test all allocations limited
by the space option. Then you move to test only your module's
own slabs.
- when set by command line flags it effectively disables all
cache merges.
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Jann Horn <jannh@google.com>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Link: http://lkml.kernel.org/r/20200610163135.17364-5-vbabka@suse.cz
Signed-off-by: Alexander Atanasov <alexander.atanasov@virtuozzo.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Rasmus Villemoes [Fri, 30 Sep 2022 10:27:12 +0000 (12:27 +0200)]
mm: slub: make slab_sysfs_init() a late_initcall
Currently, slab_sysfs_init() is an __initcall aka device_initcall. It
is rather time-consuming; on my board it takes around 11ms. That's
about 1% of the time budget I have from U-Boot letting go and until
linux must assume responsibility of keeping the external watchdog
happy.
There's no particular reason this would need to run at device_initcall
time, so instead make it a late_initcall to allow vital functionality
to get started a bit sooner.
This actually ends up winning more than just those 11ms, because the
slab caches that get created during other device_initcalls (and before
my watchdog device gets probed) now don't end up doing the somewhat
expensive sysfs_slab_add() themselves. Some example lines (with
initcall_debug set) before/after:
initcall ext4_init_fs+0x0/0x1ac returned 0 after 1386 usecs
initcall journal_init+0x0/0x138 returned 0 after 517 usecs
initcall init_fat_fs+0x0/0x68 returned 0 after 294 usecs
initcall ext4_init_fs+0x0/0x1ac returned 0 after 240 usecs
initcall journal_init+0x0/0x138 returned 0 after 32 usecs
initcall init_fat_fs+0x0/0x68 returned 0 after 18 usecs
Altogether, this means I now get to petting the watchdog around 17ms
sooner. [Of course, the time the other initcalls save is instead spent
in slab_sysfs_init(), which goes from 11ms to 16ms, so there's no
overall change in boot time.]
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Rasmus Villemoes [Fri, 30 Sep 2022 08:47:42 +0000 (10:47 +0200)]
mm: slub: remove dead and buggy code from sysfs_slab_add()
The function sysfs_slab_add() has two callers:
One is slab_sysfs_init(), which first initializes slab_kset, and only
when that succeeds sets slab_state to FULL, and then proceeds to call
sysfs_slab_add() for all previously created slabs.
The other is __kmem_cache_create(), but only after a
if (slab_state <= UP)
return 0;
check.
So in other words, sysfs_slab_add() is never called without
slab_kset (aka the return value of cache_kset()) being non-NULL.
And this is just as well, because if we ever did take this path and
called kobject_init(&s->kobj), and then later when called again from
slab_sysfs_init() would end up calling kobject_init_and_add(), we
would hit
if (kobj->state_initialized) {
/* do not error out as sometimes we can recover */
pr_err("kobject (%p): tried to init an initialized object, something is seriously wrong.\n",
dump_stack();
}
in kobject.c.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Linus Torvalds [Sun, 23 Oct 2022 22:27:33 +0000 (15:27 -0700)]
Linux 6.1-rc2
Linus Torvalds [Sun, 23 Oct 2022 22:00:43 +0000 (15:00 -0700)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"RISC-V:
- Fix compilation without RISCV_ISA_ZICBOM
- Fix kvm_riscv_vcpu_timer_pending() for Sstc
ARM:
- Fix a bug preventing restoring an ITS containing mappings for very
large and very sparse device topology
- Work around a relocation handling error when compiling the nVHE
object with profile optimisation
- Fix for stage-2 invalidation holding the VM MMU lock for too long
by limiting the walk to the largest block mapping size
- Enable stack protection and branch profiling for VHE
- Two selftest fixes
x86:
- add compat implementation for KVM_X86_SET_MSR_FILTER ioctl
selftests:
- synchronize includes between include/uapi and tools/include/uapi"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
tools: include: sync include/api/linux/kvm.h
KVM: x86: Add compat handler for KVM_X86_SET_MSR_FILTER
KVM: x86: Copy filter arg outside kvm_vm_ioctl_set_msr_filter()
kvm: Add support for arch compat vm ioctls
RISC-V: KVM: Fix kvm_riscv_vcpu_timer_pending() for Sstc
RISC-V: Fix compilation without RISCV_ISA_ZICBOM
KVM: arm64: vgic: Fix exit condition in scan_its_table()
KVM: arm64: nvhe: Fix build with profile optimization
KVM: selftests: Fix number of pages for memory slot in memslot_modification_stress_test
KVM: arm64: selftests: Fix multiple versions of GIC creation
KVM: arm64: Enable stack protection and branch profiling for VHE
KVM: arm64: Limit stage2_apply_range() batch size to largest block
KVM: arm64: Work out supported block level at compile time
Jason A. Donenfeld [Sat, 8 Oct 2022 15:47:00 +0000 (09:47 -0600)]
Revert "mfd: syscon: Remove repetition of the regmap_get_val_endian()"
This reverts commit
72a95859728a7866522e6633818bebc1c2519b17.
It broke reboots on big-endian MIPS and MIPS64 malta QEMU instances,
which use the syscon driver. Little-endian is not effected, which means
likely it's important to handle regmap_get_val_endian() in this function
after all.
Fixes:
72a95859728a ("mfd: syscon: Remove repetition of the regmap_get_val_endian()")
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Lee Jones <lee@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 23 Oct 2022 19:01:01 +0000 (12:01 -0700)]
kernel/utsname_sysctl.c: Fix hostname polling
Commit
bfca3dd3d068 ("kernel/utsname_sysctl.c: print kernel arch") added
a new entry to the uts_kern_table[] array, but didn't update the
UTS_PROC_xyz enumerators of older entries, breaking anything that used
them.
Which is admittedly not many cases: it's really just the two uses of
uts_proc_notify() in kernel/sys.c. But apparently journald-systemd
actually uses this to detect hostname changes.
Reported-by: Torsten Hilbrich <torsten.hilbrich@secunet.com>
Fixes:
bfca3dd3d068 ("kernel/utsname_sysctl.c: print kernel arch")
Link: https://lore.kernel.org/lkml/0c2b92a6-0f25-9538-178f-eee3b06da23f@secunet.com/
Link: https://linux-regtracking.leemhuis.info/regzbot/regression/0c2b92a6-0f25-9538-178f-eee3b06da23f@secunet.com/
Cc: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 23 Oct 2022 17:14:45 +0000 (10:14 -0700)]
Merge tag 'perf_urgent_for_v6.1_rc2' of git://git./linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Fix raw data handling when perf events are used in bpf
- Rework how SIGTRAPs get delivered to events to address a bunch of
problems with it. Add a selftest for that too
* tag 'perf_urgent_for_v6.1_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
bpf: Fix sample_flags for bpf_perf_event_output
selftests/perf_events: Add a SIGTRAP stress test with disables
perf: Fix missing SIGTRAPs
Linus Torvalds [Sun, 23 Oct 2022 17:10:55 +0000 (10:10 -0700)]
Merge tag 'sched_urgent_for_v6.1_rc2' of git://git./linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Adjust code to not trip up CFI
- Fix sched group cookie matching
* tag 'sched_urgent_for_v6.1_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Introduce struct balance_callback to avoid CFI mismatches
sched/core: Fix comparison in sched_group_cookie_match()
Linus Torvalds [Sun, 23 Oct 2022 17:07:01 +0000 (10:07 -0700)]
Merge tag 'objtool_urgent_for_v6.1_rc2' of git://git./linux/kernel/git/tip/tip
Pull objtool fix from Borislav Petkov:
- Fix ORC stack unwinding when GCOV is enabled
* tag 'objtool_urgent_for_v6.1_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/unwind/orc: Fix unreliable stack dump with gcov
Linus Torvalds [Sun, 23 Oct 2022 17:01:34 +0000 (10:01 -0700)]
Merge tag 'x86_urgent_for_v6.0_rc2' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
"As usually the case, right after a major release, the tip urgent
branches accumulate a couple more fixes than normal. And here is the
x86, a bit bigger, urgent pile.
- Use the correct CPU capability clearing function on the error path
in Intel perf LBR
- A CFI fix to ftrace along with a simplification
- Adjust handling of zero capacity bit mask for resctrl cache
allocation on AMD
- A fix to the AMD microcode loader to attempt patch application on
every logical thread
- A couple of topology fixes to handle CPUID leaf 0x1f enumeration
info properly
- Drop a -mabi=ms compiler option check as both compilers support it
now anyway
- A couple of fixes to how the initial, statically allocated FPU
buffer state is setup and its interaction with dynamic states at
runtime"
* tag 'x86_urgent_for_v6.0_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Fix copy_xstate_to_uabi() to copy init states correctly
perf/x86/intel/lbr: Use setup_clear_cpu_cap() instead of clear_cpu_cap()
ftrace,kcfi: Separate ftrace_stub() and ftrace_stub_graph()
x86/ftrace: Remove ftrace_epilogue()
x86/resctrl: Fix min_cbm_bits for AMD
x86/microcode/AMD: Apply the patch early on every logical thread
x86/topology: Fix duplicated core ID within a package
x86/topology: Fix multiple packages shown on a single-package system
hwmon/coretemp: Handle large core ID value
x86/Kconfig: Drop check for -mabi=ms for CONFIG_EFI_STUB
x86/fpu: Exclude dynamic states from init_fpstate
x86/fpu: Fix the init_fpstate size check with the actual size
x86/fpu: Configure init_fpstate attributes orderly
Linus Torvalds [Sun, 23 Oct 2022 16:55:50 +0000 (09:55 -0700)]
Merge tag 'io_uring-6.1-2022-10-22' of git://git.kernel.dk/linux
Pull io_uring follow-up from Jens Axboe:
"Currently the zero-copy has automatic fallback to normal transmit, and
it was decided that it'd be cleaner to return an error instead if the
socket type doesn't support it.
Zero-copy does work with UDP and TCP, it's more of a future proofing
kind of thing (eg for samba)"
* tag 'io_uring-6.1-2022-10-22' of git://git.kernel.dk/linux:
io_uring/net: fail zc sendmsg when unsupported by socket
io_uring/net: fail zc send when unsupported by socket
net: flag sockets supporting msghdr originated zerocopy
Linus Torvalds [Sat, 22 Oct 2022 23:04:34 +0000 (16:04 -0700)]
Merge tag 'hwmon-for-v6.1-rc2' of git://git./linux/kernel/git/groeck/linux-staging
Pull hwmon fixes from Guenter Roeck:
- corsair-psu: Fix typo in USB id description, and add USB ID for new
PSU
- pwm-fan: Fix fan power handling when disabling fan control
* tag 'hwmon-for-v6.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (corsair-psu) Add USB id of the new HX1500i psu
hwmon: (pwm-fan) Explicitly switch off fan power when setting pwm1_enable to 0
hwmon: (corsair-psu) fix typo in USB id description
Linus Torvalds [Sat, 22 Oct 2022 22:59:46 +0000 (15:59 -0700)]
Merge tag 'i2c-for-6.1-rc2' of git://git./linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"RPM fix for qcom-cci, platform module alias for xiic, build warning
fix for mlxbf, typo fixes in comments"
* tag 'i2c-for-6.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: mlxbf: depend on ACPI; clean away ifdeffage
i2c: fix spelling typos in comments
i2c: qcom-cci: Fix ordering of pm_runtime_xx and i2c_add_adapter
i2c: xiic: Add platform module alias
Linus Torvalds [Sat, 22 Oct 2022 22:52:36 +0000 (15:52 -0700)]
Merge tag 'pci-v6.1-fixes-2' of git://git./linux/kernel/git/helgaas/pci
Pull pci fixes from Bjorn Helgaas:
- Revert a simplification that broke pci-tegra due to a masking error
- Update MAINTAINERS for Kishon's email address change and TI
DRA7XX/J721E maintainer change
* tag 'pci-v6.1-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
MAINTAINERS: Update Kishon's email address in PCI endpoint subsystem
MAINTAINERS: Add Vignesh Raghavendra as maintainer of TI DRA7XX/J721E PCI driver
Revert "PCI: tegra: Use PCI_CONF1_EXT_ADDRESS() macro"
Linus Torvalds [Sat, 22 Oct 2022 22:30:15 +0000 (15:30 -0700)]
Merge tag 'media/v6.1-2' of git://git./linux/kernel/git/mchehab/linux-media
Pull missed media updates from Mauro Carvalho Chehab:
"It seems I screwed-up my previous pull request: it ends up that only
half of the media patches that were in linux-next got merged in -rc1.
The script which creates the signed tags silently failed due to
5.19->6.0 so it ended generating a tag with incomplete stuff.
So here are the missing parts:
- a DVB core security fix
- lots of fixes and cleanups for atomisp staging driver
- old drivers that are VB1 are being moved to staging to be
deprecated
- several driver updates - mostly for embedded systems, but there are
also some things addressing issues with some PC webcams, in the UVC
video driver"
* tag 'media/v6.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (163 commits)
media: sun6i-csi: Move csi buffer definition to main header file
media: sun6i-csi: Introduce and use video helper functions
media: sun6i-csi: Add media ops with link notify callback
media: sun6i-csi: Remove controls handler from the driver
media: sun6i-csi: Register the media device after creation
media: sun6i-csi: Pass and store csi device directly in video code
media: sun6i-csi: Tidy up video code
media: sun6i-csi: Tidy up v4l2 code
media: sun6i-csi: Tidy up Kconfig
media: sun6i-csi: Use runtime pm for clocks and reset
media: sun6i-csi: Define and use variant to get module clock rate
media: sun6i-csi: Always set exclusive module clock rate
media: sun6i-csi: Tidy up platform code
media: sun6i-csi: Refactor main driver data structures
media: sun6i-csi: Define and use driver name and (reworked) description
media: cedrus: Add a Kconfig dependency on RESET_CONTROLLER
media: sun8i-rotate: Add a Kconfig dependency on RESET_CONTROLLER
media: sun8i-di: Add a Kconfig dependency on RESET_CONTROLLER
media: sun4i-csi: Add a Kconfig dependency on RESET_CONTROLLER
media: sun6i-csi: Add a Kconfig dependency on RESET_CONTROLLER
...
Pavel Begunkov [Fri, 21 Oct 2022 10:16:41 +0000 (11:16 +0100)]
io_uring/net: fail zc sendmsg when unsupported by socket
The previous patch fails zerocopy send requests for protocols that don't
support it, do the same for zerocopy sendmsg.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0854e7bb4c3d810a48ec8b5853e2f61af36a0467.1666346426.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 21 Oct 2022 10:16:40 +0000 (11:16 +0100)]
io_uring/net: fail zc send when unsupported by socket
If a protocol doesn't support zerocopy it will silently fall back to
copying. This type of behaviour has always been a source of troubles
so it's better to fail such requests instead.
Cc: <stable@vger.kernel.org> # 6.0
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2db3c7f16bb6efab4b04569cd16e6242b40c5cb3.1666346426.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 21 Oct 2022 10:16:39 +0000 (11:16 +0100)]
net: flag sockets supporting msghdr originated zerocopy
We need an efficient way in io_uring to check whether a socket supports
zerocopy with msghdr provided ubuf_info. Add a new flag into the struct
socket flags fields.
Cc: <stable@vger.kernel.org> # 6.0
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/3dafafab822b1c66308bb58a0ac738b1e3f53f74.1666346426.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Wilken Gottwalt [Sat, 8 Oct 2022 11:35:34 +0000 (11:35 +0000)]
hwmon: (corsair-psu) Add USB id of the new HX1500i psu
Also update the documentation accordingly.
Signed-off-by: Wilken Gottwalt <wilken.gottwalt@posteo.net>
Link: https://lore.kernel.org/r/Y0FghqQCHG/cX5Jz@monster.localdomain
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Paolo Bonzini [Sat, 22 Oct 2022 11:43:52 +0000 (07:43 -0400)]
tools: include: sync include/api/linux/kvm.h
Provide a definition of KVM_CAP_DIRTY_LOG_RING_ACQ_REL.
Fixes:
17601bfed909 ("KVM: Add KVM_CAP_DIRTY_LOG_RING_ACQ_REL capability and config option")
Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Alexander Graf [Mon, 17 Oct 2022 18:45:41 +0000 (20:45 +0200)]
KVM: x86: Add compat handler for KVM_X86_SET_MSR_FILTER
The KVM_X86_SET_MSR_FILTER ioctls contains a pointer in the passed in
struct which means it has a different struct size depending on whether
it gets called from 32bit or 64bit code.
This patch introduces compat code that converts from the 32bit struct to
its 64bit counterpart which then gets used going forward internally.
With this applied, 32bit QEMU can successfully set MSR bitmaps when
running on 64bit kernels.
Reported-by: Andrew Randrianasulu <randrianasulu@gmail.com>
Fixes:
1a155254ff937 ("KVM: x86: Introduce MSR filtering")
Signed-off-by: Alexander Graf <graf@amazon.com>
Message-Id: <
20221017184541.2658-4-graf@amazon.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Alexander Graf [Mon, 17 Oct 2022 18:45:40 +0000 (20:45 +0200)]
KVM: x86: Copy filter arg outside kvm_vm_ioctl_set_msr_filter()
In the next patch we want to introduce a second caller to
set_msr_filter() which constructs its own filter list on the stack.
Refactor the original function so it takes it as argument instead of
reading it through copy_from_user().
Signed-off-by: Alexander Graf <graf@amazon.com>
Message-Id: <
20221017184541.2658-3-graf@amazon.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Alexander Graf [Mon, 17 Oct 2022 18:45:39 +0000 (20:45 +0200)]
kvm: Add support for arch compat vm ioctls
We will introduce the first architecture specific compat vm ioctl in the
next patch. Add all necessary boilerplate to allow architectures to
override compat vm ioctls when necessary.
Signed-off-by: Alexander Graf <graf@amazon.com>
Message-Id: <
20221017184541.2658-2-graf@amazon.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Sat, 22 Oct 2022 07:33:58 +0000 (03:33 -0400)]
Merge tag 'kvm-riscv-fixes-6.1-1' of https://github.com/kvm-riscv/linux into HEAD
KVM/riscv fixes for 6.1, take #1
- Fix compilation without RISCV_ISA_ZICBOM
- Fix kvm_riscv_vcpu_timer_pending() for Sstc
Paolo Bonzini [Sat, 22 Oct 2022 07:33:26 +0000 (03:33 -0400)]
Merge tag 'kvmarm-fixes-6.1-2' of git://git./linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.1, take #2
- Fix a bug preventing restoring an ITS containing mappings
for very large and very sparse device topology
- Work around a relocation handling error when compiling
the nVHE object with profile optimisation
Paolo Bonzini [Sat, 22 Oct 2022 07:32:23 +0000 (03:32 -0400)]
Merge tag 'kvmarm-fixes-6.1-1' of git://git./linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.1, take #1
- Fix for stage-2 invalidation holding the VM MMU lock
for too long by limiting the walk to the largest
block mapping size
- Enable stack protection and branch profiling for VHE
- Two selftest fixes
Linus Torvalds [Sat, 22 Oct 2022 01:26:00 +0000 (18:26 -0700)]
Merge tag 'thermal-6.1-rc2' of git://git./linux/kernel/git/rafael/linux-pm
Pull thermal control fix from Rafael Wysocki:
"This fixes the control CPU selection in the intel_powerclamp thermal
driver"
* tag 'thermal-6.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
thermal: intel_powerclamp: Use first online CPU as control_cpu
Linus Torvalds [Sat, 22 Oct 2022 01:19:42 +0000 (18:19 -0700)]
Merge tag 'pm-6.1-rc2' of git://git./linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix some issues and clean up code in ARM cpufreq drivers.
Specifics:
- Fix module loading in the Tegra124 cpufreq driver (Jon Hunter)
- Fix memory leak and update to read-only region in the qcom cpufreq
driver (Fabien Parent)
- Miscellaneous minor cleanups to cpufreq drivers (Fabien Parent,
Yang Yingliang)"
* tag 'pm-6.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: sun50i: Switch to use dev_err_probe() helper
cpufreq: qcom-nvmem: Switch to use dev_err_probe() helper
cpufreq: imx6q: Switch to use dev_err_probe() helper
cpufreq: dt: Switch to use dev_err_probe() helper
cpufreq: qcom: remove unused parameter in function definition
cpufreq: qcom: fix writes in read-only memory region
cpufreq: qcom: fix memory leak in error path
cpufreq: tegra194: Fix module loading
Linus Torvalds [Sat, 22 Oct 2022 01:08:30 +0000 (18:08 -0700)]
Merge tag 'acpi-6.1-rc2' of git://git./linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These fix issues introduced during this merge window (ACPI/PCI, device
enumeration and documentation) and some other ones found recently.
Specifics:
- Add missing device reference counting to acpi_get_pci_dev() after
changing it recently (Rafael Wysocki)
- Fix resource list walk in acpi_dma_get_range() (Robin Murphy)
- Add IRQ override quirk for LENOVO IdeaPad and extend the IRQ
override warning message (Jiri Slaby)
- Fix integer overflow in ghes_estatus_pool_init() (Ashish Kalra)
- Fix multiple error records handling in one of the ACPI extlog
driver code paths (Tony Luck)
- Prune DSDT override documentation from index after dropping it
(Bagas Sanjaya)"
* tag 'acpi-6.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: scan: Fix DMA range assignment
ACPI: PCI: Fix device reference counting in acpi_get_pci_dev()
ACPI: resource: note more about IRQ override
ACPI: resource: do IRQ override on LENOVO IdeaPad
ACPI: extlog: Handle multiple records
ACPI: APEI: Fix integer overflow in ghes_estatus_pool_init()
Documentation: ACPI: Prune DSDT override documentation from index
Linus Torvalds [Sat, 22 Oct 2022 01:02:36 +0000 (18:02 -0700)]
Merge tag 'efi-fixes-for-v6.1-1' of git://git./linux/kernel/git/efi/efi
Pull EFI fixes from Ard Biesheuvel:
- fixes for the EFI variable store refactor that landed in v6.0
- fixes for issues that were introduced during the merge window
- back out some changes related to EFI zboot signing - we'll add a
better solution for this during the next cycle
* tag 'efi-fixes-for-v6.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efi: runtime: Don't assume virtual mappings are missing if VA == PA == 0
efi: libstub: Fix incorrect payload size in zboot header
efi: libstub: Give efi_main() asmlinkage qualification
efi: efivars: Fix variable writes without query_variable_store()
efi: ssdt: Don't free memory if ACPI table was loaded successfully
efi: libstub: Remove zboot signing from build options
Linus Torvalds [Sat, 22 Oct 2022 00:47:39 +0000 (17:47 -0700)]
Merge tag 'iommu-fixes-v6.1-rc1' of git://git./linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:
"Intel VT-d fixes:
- Fix a lockdep splat issue in intel_iommu_init()
- Allow NVS regions to pass RMRR check
- Domain cleanup in error path"
* tag 'iommu-fixes-v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/vt-d: Clean up si_domain in the init_dmars() error path
iommu/vt-d: Allow NVS regions in arch_rmrr_sanity_check()
iommu/vt-d: Use rcu_lock in get_resv_regions
iommu: Add gfp parameter to iommu_alloc_resv_region
Linus Torvalds [Sat, 22 Oct 2022 00:41:57 +0000 (17:41 -0700)]
Merge tag 'for-linus-
2022102101' of git://git./linux/kernel/git/hid/hid
Pull HID fixes from Benjamin Tissoires:
- a 12 year old bug fix for the Apple Magic Trackpad v1 (José Expósito)
- a fix for a potential crash on removal of the Playstation controllers
(Roderick Colenbrander)
- a few new device IDs and device-specific quirks, most notably support
of the new Playstation DualSense Edge controller
* tag 'for-linus-
2022102101' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
HID: lenovo: Make array tp10ubkbd_led static const
HID: saitek: add madcatz variant of MMO7 mouse device ID
HID: playstation: support updated DualSense rumble mode.
HID: playstation: add initial DualSense Edge controller support
HID: playstation: stop DualSense output work on remove.
HID: magicmouse: Do not set BTN_MOUSE on double report
Linus Torvalds [Fri, 21 Oct 2022 23:01:53 +0000 (16:01 -0700)]
Merge tag '6.1-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
- memory leak fixes
- fixes for directory leases, including an important one which fixes a
problem noticed by git functional tests
- fixes relating to missing free_xid calls (helpful for
tracing/debugging of entry/exit into cifs.ko)
- a multichannel fix
- a small cleanup fix (use of list_move instead of list_del/list_add)
* tag '6.1-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal module number
cifs: fix memory leaks in session setup
cifs: drop the lease for cached directories on rmdir or rename
smb3: interface count displayed incorrectly
cifs: Fix memory leak when build ntlmssp negotiate blob failed
cifs: set rc to -ENOENT if we can not get a dentry for the cached dir
cifs: use LIST_HEAD() and list_move() to simplify code
cifs: Fix xid leak in cifs_get_file_info_unix()
cifs: Fix xid leak in cifs_ses_add_channel()
cifs: Fix xid leak in cifs_flock()
cifs: Fix xid leak in cifs_copy_file_range()
cifs: Fix xid leak in cifs_create()
Linus Torvalds [Fri, 21 Oct 2022 22:51:30 +0000 (15:51 -0700)]
Merge tag 'nfsd-6.1-2' of git://git./linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
"Fixes for patches merged in v6.1"
* tag 'nfsd-6.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: ensure we always call fh_verify_error tracepoint
NFSD: unregister shrinker when nfsd_init_net() fails
Chang S. Bae [Fri, 21 Oct 2022 18:58:44 +0000 (11:58 -0700)]
x86/fpu: Fix copy_xstate_to_uabi() to copy init states correctly
When an extended state component is not present in fpstate, but in init
state, the function copies from init_fpstate via copy_feature().
But, dynamic states are not present in init_fpstate because of all-zeros
init states. Then retrieving them from init_fpstate will explode like this:
BUG: kernel NULL pointer dereference, address:
0000000000000000
...
RIP: 0010:memcpy_erms+0x6/0x10
? __copy_xstate_to_uabi_buf+0x381/0x870
fpu_copy_guest_fpstate_to_uabi+0x28/0x80
kvm_arch_vcpu_ioctl+0x14c/0x1460 [kvm]
? __this_cpu_preempt_check+0x13/0x20
? vmx_vcpu_put+0x2e/0x260 [kvm_intel]
kvm_vcpu_ioctl+0xea/0x6b0 [kvm]
? kvm_vcpu_ioctl+0xea/0x6b0 [kvm]
? __fget_light+0xd4/0x130
__x64_sys_ioctl+0xe3/0x910
? debug_smp_processor_id+0x17/0x20
? fpregs_assert_state_consistent+0x27/0x50
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Adjust the 'mask' to zero out the userspace buffer for the features that
are not available both from fpstate and from init_fpstate.
The dynamic features depend on the compacted XSAVE format. Ensure it is
enabled before reading XCOMP_BV in init_fpstate.
Fixes:
2308ee57d93d ("x86/fpu/amx: Enable the AMX feature in 64-bit mode")
Reported-by: Yuan Yao <yuan.yao@intel.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/lkml/BYAPR11MB3717EDEF2351C958F2C86EED95259@BYAPR11MB3717.namprd11.prod.outlook.com/
Link: https://lkml.kernel.org/r/20221021185844.13472-1-chang.seok.bae@intel.com
Linus Torvalds [Fri, 21 Oct 2022 22:19:43 +0000 (15:19 -0700)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Two small changes, one in the lpfc driver and the other in the core.
The core change is an additional footgun guard which prevents users
from writing the wrong state to sysfs and causing a hang"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: lpfc: Fix memory leak in lpfc_create_port()
scsi: core: Restrict legal sdev_state transitions via sysfs
Linus Torvalds [Fri, 21 Oct 2022 22:14:14 +0000 (15:14 -0700)]
Merge tag 'block-6.1-2022-10-20' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- NVMe pull request via Christoph:
- fix nvme-hwmon for DMA non-cohehrent architectures (Serge Semin)
- add a nvme-hwmong maintainer (Christoph Hellwig)
- fix error pointer dereference in error handling (Dan Carpenter)
- fix invalid memory reference in nvmet_subsys_attr_qid_max_show
(Daniel Wagner)
- don't limit the DMA segment size in nvme-apple (Russell King)
- fix workqueue MEM_RECLAIM flushing dependency (Sagi Grimberg)
- disable write zeroes on various Kingston SSDs (Xander Li)
- fix a memory leak with block device tracing (Ye)
- flexible-array fix for ublk (Yushan)
- document the ublk recovery feature from this merge window
(ZiyangZhang)
- remove dead bfq variable in struct (Yuwei)
- error handling rq clearing fix (Yu)
- add an IRQ safety check for the cached bio freeing (Pavel)
- drbd bio cloning fix (Christoph)
* tag 'block-6.1-2022-10-20' of git://git.kernel.dk/linux:
blktrace: remove unnessary stop block trace in 'blk_trace_shutdown'
blktrace: fix possible memleak in '__blk_trace_remove'
blktrace: introduce 'blk_trace_{start,stop}' helper
bio: safeguard REQ_ALLOC_CACHE bio put
block, bfq: remove unused variable for bfq_queue
drbd: only clone bio if we have a backing device
ublk_drv: use flexible-array member instead of zero-length array
nvmet: fix invalid memory reference in nvmet_subsys_attr_qid_max_show
nvmet: fix workqueue MEM_RECLAIM flushing dependency
nvme-hwmon: kmalloc the NVME SMART log buffer
nvme-hwmon: consistently ignore errors from nvme_hwmon_init
nvme: add Guenther as nvme-hwmon maintainer
nvme-apple: don't limit DMA segement size
nvme-pci: disable write zeroes on various Kingston SSD
nvme: fix error pointer dereference in error handling
Documentation: document ublk user recovery feature
blk-mq: fix null pointer dereference in blk_mq_clear_rq_mapping()
Linus Torvalds [Fri, 21 Oct 2022 22:09:10 +0000 (15:09 -0700)]
Merge tag 'io_uring-6.1-2022-10-20' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
- Fix a potential memory leak in the error handling path of io-wq setup
(Rafael)
- Kill an errant debug statement that got added in this release (me)
- Fix an oops with an invalid direct descriptor with IORING_OP_MSG_RING
(Harshit)
- Remove unneeded FFS_SCM flagging (Pavel)
- Remove polling off the exit path (Pavel)
- Move out direct descriptor debug check to the cleanup path (Pavel)
- Use the proper helper rather than open-coding cached request get
(Pavel)
* tag 'io_uring-6.1-2022-10-20' of git://git.kernel.dk/linux:
io-wq: Fix memory leak in worker creation
io_uring/msg_ring: Fix NULL pointer dereference in io_msg_send_fd()
io_uring/rw: remove leftover debug statement
io_uring: don't iopoll from io_ring_ctx_wait_and_kill()
io_uring: reuse io_alloc_req()
io_uring: kill hot path fixed file bitmap debug checks
io_uring: remove FFS_SCM
Linus Torvalds [Fri, 21 Oct 2022 21:43:09 +0000 (14:43 -0700)]
Merge tag 'for-linus-6.1-rc2-tag' of git://git./linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"Just two fixes for the new 'virtio with grants' feature"
* tag 'for-linus-6.1-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/virtio: Convert PAGE_SIZE/PAGE_SHIFT/PFN_UP to Xen counterparts
xen/virtio: Handle cases when page offset > PAGE_SIZE properly
Linus Torvalds [Fri, 21 Oct 2022 21:33:36 +0000 (14:33 -0700)]
Merge tag 'selinux-pr-
20221020' of git://git./linux/kernel/git/pcmoore/selinux
Pull selinux fix from Paul Moore:
"A small SELinux fix for a GFP_KERNEL allocation while a spinlock is
held.
The patch, while still fairly small, is a bit larger than one might
expect from a simple s/GFP_KERNEL/GFP_ATOMIC/ conversion because we
added support for the function to be called with different gfp flags
depending on the context, preserving GFP_KERNEL for those cases that
can safely sleep"
* tag 'selinux-pr-
20221020' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
selinux: enable use of both GFP_KERNEL and GFP_ATOMIC in convert_context()
Linus Torvalds [Fri, 21 Oct 2022 19:33:03 +0000 (12:33 -0700)]
Merge tag 'mm-hotfixes-stable-2022-10-20' of git://git./linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morron:
"Seventeen hotfixes, mainly for MM.
Five are cc:stable and the remainder address post-6.0 issues"
* tag 'mm-hotfixes-stable-2022-10-20' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
nouveau: fix migrate_to_ram() for faulting page
mm/huge_memory: do not clobber swp_entry_t during THP split
hugetlb: fix memory leak associated with vma_lock structure
mm/page_alloc: reduce potential fragmentation in make_alloc_exact()
mm: /proc/pid/smaps_rollup: fix maple tree search
mm,hugetlb: take hugetlb_lock before decrementing h->resv_huge_pages
mm/mmap: fix MAP_FIXED address return on VMA merge
mm/mmap.c: __vma_adjust(): suppress uninitialized var warning
mm/mmap: undo ->mmap() when mas_preallocate() fails
init: Kconfig: fix spelling mistake "satify" -> "satisfy"
ocfs2: clear dinode links count in case of error
ocfs2: fix BUG when iput after ocfs2_mknod fails
gcov: support GCC 12.1 and newer compilers
zsmalloc: zs_destroy_pool: add size_class NULL check
mm/mempolicy: fix mbind_range() arguments to vma_merge()
mailmap: update email for Qais Yousef
mailmap: update Dan Carpenter's email address
Linus Torvalds [Fri, 21 Oct 2022 19:29:52 +0000 (12:29 -0700)]
Merge tag 'trace-tools-6.1-rc1' of git://git./linux/kernel/git/trace/linux-trace
Pull tracing tool update from Steven Rostedt:
- Make dot2c generate monitor's automata definition static
* tag 'trace-tools-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rv/dot2c: Make automaton definition static
Linus Torvalds [Fri, 21 Oct 2022 19:25:39 +0000 (12:25 -0700)]
Merge tag 'linux-watchdog-6.1-rc2' of git://linux-watchdog.org/linux-watchdog
Pull watchdog updates from Wim Van Sebroeck:
- Add tracing events for the most common watchdog events
* tag 'linux-watchdog-6.1-rc2' of git://www.linux-watchdog.org/linux-watchdog:
watchdog: Add tracing events for the most usual watchdog events
Rafael J. Wysocki [Fri, 21 Oct 2022 18:07:41 +0000 (20:07 +0200)]
Merge branches 'acpi-scan', 'acpi-resource', 'acpi-apei', 'acpi-extlog' and 'acpi-docs'
Merge assorted ACPI fixes for 6.1-rc2:
- Fix resource list walk in acpi_dma_get_range() (Robin Murphy).
- Add IRQ override quirk for LENOVO IdeaPad and extend the IRQ
override warning message (Jiri Slaby).
- Fix integer overflow in ghes_estatus_pool_init() (Ashish Kalra).
- Fix multiple error records handling in one of the ACPI extlog driver
code paths (Tony Luck).
- Prune DSDT override documentation from index after dropping it (Bagas
Sanjaya).
* acpi-scan:
ACPI: scan: Fix DMA range assignment
* acpi-resource:
ACPI: resource: note more about IRQ override
ACPI: resource: do IRQ override on LENOVO IdeaPad
* acpi-apei:
ACPI: APEI: Fix integer overflow in ghes_estatus_pool_init()
* acpi-extlog:
ACPI: extlog: Handle multiple records
* acpi-docs:
Documentation: ACPI: Prune DSDT override documentation from index
Chen Zhongjin [Wed, 27 Jul 2022 03:15:06 +0000 (11:15 +0800)]
x86/unwind/orc: Fix unreliable stack dump with gcov
When a console stack dump is initiated with CONFIG_GCOV_PROFILE_ALL
enabled, show_trace_log_lvl() gets out of sync with the ORC unwinder,
causing the stack trace to show all text addresses as unreliable:
# echo l > /proc/sysrq-trigger
[ 477.521031] sysrq: Show backtrace of all active CPUs
[ 477.523813] NMI backtrace for cpu 0
[ 477.524492] CPU: 0 PID: 1021 Comm: bash Not tainted 6.0.0 #65
[ 477.525295] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-1.fc36 04/01/2014
[ 477.526439] Call Trace:
[ 477.526854] <TASK>
[ 477.527216] ? dump_stack_lvl+0xc7/0x114
[ 477.527801] ? dump_stack+0x13/0x1f
[ 477.528331] ? nmi_cpu_backtrace.cold+0xb5/0x10d
[ 477.528998] ? lapic_can_unplug_cpu+0xa0/0xa0
[ 477.529641] ? nmi_trigger_cpumask_backtrace+0x16a/0x1f0
[ 477.530393] ? arch_trigger_cpumask_backtrace+0x1d/0x30
[ 477.531136] ? sysrq_handle_showallcpus+0x1b/0x30
[ 477.531818] ? __handle_sysrq.cold+0x4e/0x1ae
[ 477.532451] ? write_sysrq_trigger+0x63/0x80
[ 477.533080] ? proc_reg_write+0x92/0x110
[ 477.533663] ? vfs_write+0x174/0x530
[ 477.534265] ? handle_mm_fault+0x16f/0x500
[ 477.534940] ? ksys_write+0x7b/0x170
[ 477.535543] ? __x64_sys_write+0x1d/0x30
[ 477.536191] ? do_syscall_64+0x6b/0x100
[ 477.536809] ? entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 477.537609] </TASK>
This happens when the compiled code for show_stack() has a single word
on the stack, and doesn't use a tail call to show_stack_log_lvl().
(CONFIG_GCOV_PROFILE_ALL=y is the only known case of this.) Then the
__unwind_start() skip logic hits an off-by-one bug and fails to unwind
all the way to the intended starting frame.
Fix it by reverting the following commit:
f1d9a2abff66 ("x86/unwind/orc: Don't skip the first frame for inactive tasks")
The original justification for that commit no longer exists. That
original issue was later fixed in a different way, with the following
commit:
f2ac57a4c49d ("x86/unwind/orc: Fix inactive tasks with stack pointer in %sp on GCC 10 compiled kernels")
Fixes:
f1d9a2abff66 ("x86/unwind/orc: Don't skip the first frame for inactive tasks")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
[jpoimboe: rewrite commit log]
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Ard Biesheuvel [Thu, 20 Oct 2022 13:16:09 +0000 (15:16 +0200)]
efi: runtime: Don't assume virtual mappings are missing if VA == PA == 0
The generic EFI stub can be instructed to avoid SetVirtualAddressMap(),
and simply run with the firmware's 1:1 mapping. In this case, it
populates the virtual address fields of the runtime regions in the
memory map with the physical address of each region, so that the mapping
code has to be none the wiser. Only if SetVirtualAddressMap() fails, the
virtual addresses are wiped and the kernel code knows that the regions
cannot be mapped.
However, wiping amounts to setting it to zero, and if a runtime region
happens to live at physical address 0, its valid 1:1 mapped virtual
address could be mistaken for a wiped field, resulting on loss of access
to the EFI services at runtime.
So let's only assume that VA == 0 means 'no runtime services' if the
region in question does not live at PA 0x0.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Ard Biesheuvel [Thu, 20 Oct 2022 09:26:42 +0000 (11:26 +0200)]
efi: libstub: Fix incorrect payload size in zboot header
The linker script symbol definition that captures the size of the
compressed payload inside the zboot decompressor (which is exposed via
the image header) refers to '.' for the end of the region, which does
not give the correct result as the expression is not placed at the end
of the payload. So use the symbol name explicitly.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Ard Biesheuvel [Fri, 14 Oct 2022 17:29:57 +0000 (19:29 +0200)]
efi: libstub: Give efi_main() asmlinkage qualification
To stop the bots from sending sparse warnings to me and the list about
efi_main() not having a prototype, decorate it with asmlinkage so that
it is clear that it is called from assembly, and therefore needs to
remain external, even if it is never declared in a header file.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Ard Biesheuvel [Wed, 19 Oct 2022 21:29:58 +0000 (23:29 +0200)]
efi: efivars: Fix variable writes without query_variable_store()
Commit
bbc6d2c6ef22 ("efi: vars: Switch to new wrapper layer")
refactored the efivars layer so that the 'business logic' related to
which UEFI variables affect the boot flow in which way could be moved
out of it, and into the efivarfs driver.
This inadvertently broke setting variables on firmware implementations
that lack the QueryVariableInfo() boot service, because we no longer
tolerate a EFI_UNSUPPORTED result from check_var_size() when calling
efivar_entry_set_get_size(), which now ends up calling check_var_size()
a second time inadvertently.
If QueryVariableInfo() is missing, we support writes of up to 64k -
let's move that logic into check_var_size(), and drop the redundant
call.
Cc: <stable@vger.kernel.org> # v6.0
Fixes:
bbc6d2c6ef22 ("efi: vars: Switch to new wrapper layer")
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Ard Biesheuvel [Fri, 14 Oct 2022 10:25:52 +0000 (12:25 +0200)]
efi: ssdt: Don't free memory if ACPI table was loaded successfully
Amadeusz reports KASAN use-after-free errors introduced by commit
3881ee0b1edc ("efi: avoid efivars layer when loading SSDTs from
variables"). The problem appears to be that the memory that holds the
new ACPI table is now freed unconditionally, instead of only when the
ACPI core reported a failure to load the table.
So let's fix this, by omitting the kfree() on success.
Cc: <stable@vger.kernel.org> # v6.0
Link: https://lore.kernel.org/all/a101a10a-4fbb-5fae-2e3c-76cf96ed8fbd@linux.intel.com/
Fixes:
3881ee0b1edc ("efi: avoid efivars layer when loading SSDTs from variables")
Reported-by: Amadeusz Sławiński <amadeuszx.slawinski@linux.intel.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Ard Biesheuvel [Mon, 17 Oct 2022 10:48:46 +0000 (12:48 +0200)]
efi: libstub: Remove zboot signing from build options
The zboot decompressor series introduced a feature to sign the PE/COFF
kernel image for secure boot as part of the kernel build. This was
necessary because there are actually two images that need to be signed:
the kernel with the EFI stub attached, and the decompressor application.
This is a bit of a burden, because it means that the images must be
signed on the the same system that performs the build, and this is not
realistic for distros.
During the next cycle, we will introduce changes to the zboot code so
that the inner image no longer needs to be signed. This means that the
outer PE/COFF image can be handled as usual, and be signed later in the
release process.
Let's remove the associated Kconfig options now so that they don't end
up in a LTS release while already being deprecated.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Jerry Snitselaar [Wed, 19 Oct 2022 00:44:47 +0000 (08:44 +0800)]
iommu/vt-d: Clean up si_domain in the init_dmars() error path
A splat from kmem_cache_destroy() was seen with a kernel prior to
commit
ee2653bbe89d ("iommu/vt-d: Remove domain and devinfo mempool")
when there was a failure in init_dmars(), because the iommu_domain
cache still had objects. While the mempool code is now gone, there
still is a leak of the si_domain memory if init_dmars() fails. So
clean up si_domain in the init_dmars() error path.
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Will Deacon <will@kernel.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Fixes:
86080ccc223a ("iommu/vt-d: Allocate si_domain in init_dmars()")
Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>
Link: https://lore.kernel.org/r/20221010144842.308890-1-jsnitsel@redhat.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Charlotte Tan [Wed, 19 Oct 2022 00:44:46 +0000 (08:44 +0800)]
iommu/vt-d: Allow NVS regions in arch_rmrr_sanity_check()
arch_rmrr_sanity_check() warns if the RMRR is not covered by an ACPI
Reserved region, but it seems like it should accept an NVS region as
well. The ACPI spec
https://uefi.org/specs/ACPI/6.5/15_System_Address_Map_Interfaces.html
uses similar wording for "Reserved" and "NVS" region types; for NVS
regions it says "This range of addresses is in use or reserved by the
system and must not be used by the operating system."
There is an old comment on this mailing list that also suggests NVS
regions should pass the arch_rmrr_sanity_check() test:
The warnings come from arch_rmrr_sanity_check() since it checks whether
the region is E820_TYPE_RESERVED. However, if the purpose of the check
is to detect RMRR has regions that may be used by OS as free memory,
isn't E820_TYPE_NVS safe, too?
This patch overlaps with another proposed patch that would add the region
type to the log since sometimes the bug reporter sees this log on the
console but doesn't know to include the kernel log:
https://lore.kernel.org/lkml/
20220611204859.234975-3-atomlin@redhat.com/
Here's an example of the "Firmware Bug" apparent false positive (wrapped
for line length):
DMAR: [Firmware Bug]: No firmware reserved region can cover this RMRR
[0x000000006f760000-0x000000006f762fff], contact BIOS vendor for
fixes
DMAR: [Firmware Bug]: Your BIOS is broken; bad RMRR
[0x000000006f760000-0x000000006f762fff]
This is the snippet from the e820 table:
BIOS-e820: [mem 0x0000000068bff000-0x000000006ebfefff] reserved
BIOS-e820: [mem 0x000000006ebff000-0x000000006f9fefff] ACPI NVS
BIOS-e820: [mem 0x000000006f9ff000-0x000000006fffefff] ACPI data
Fixes:
f036c7fa0ab6 ("iommu/vt-d: Check VT-d RMRR region in BIOS is reported as reserved")
Cc: Will Mortensen <will@extrahop.com>
Link: https://lore.kernel.org/linux-iommu/64a5843d-850d-e58c-4fc2-0a0eeeb656dc@nec.com/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216443
Signed-off-by: Charlotte Tan <charlotte@extrahop.com>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Link: https://lore.kernel.org/r/20220929044449.32515-1-charlotte@extrahop.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Lu Baolu [Wed, 19 Oct 2022 00:44:45 +0000 (08:44 +0800)]
iommu/vt-d: Use rcu_lock in get_resv_regions
Commit
5f64ce5411b46 ("iommu/vt-d: Duplicate iommu_resv_region objects
per device list") converted rcu_lock in get_resv_regions to
dmar_global_lock to allow sleeping in iommu_alloc_resv_region(). This
introduced possible recursive locking if get_resv_regions is called from
within a section where intel_iommu_init() already holds dmar_global_lock.
Especially, after commit
57365a04c921 ("iommu: Move bus setup to IOMMU
device registration"), below lockdep splats could always be seen.
============================================
WARNING: possible recursive locking detected
6.0.0-rc4+ #325 Tainted: G I
--------------------------------------------
swapper/0/1 is trying to acquire lock:
ffffffffa8a18c90 (dmar_global_lock){++++}-{3:3}, at:
intel_iommu_get_resv_regions+0x25/0x270
but task is already holding lock:
ffffffffa8a18c90 (dmar_global_lock){++++}-{3:3}, at:
intel_iommu_init+0x36d/0x6ea
...
Call Trace:
<TASK>
dump_stack_lvl+0x48/0x5f
__lock_acquire.cold.73+0xad/0x2bb
lock_acquire+0xc2/0x2e0
? intel_iommu_get_resv_regions+0x25/0x270
? lock_is_held_type+0x9d/0x110
down_read+0x42/0x150
? intel_iommu_get_resv_regions+0x25/0x270
intel_iommu_get_resv_regions+0x25/0x270
iommu_create_device_direct_mappings.isra.28+0x8d/0x1c0
? iommu_get_dma_cookie+0x6d/0x90
bus_iommu_probe+0x19f/0x2e0
iommu_device_register+0xd4/0x130
intel_iommu_init+0x3e1/0x6ea
? iommu_setup+0x289/0x289
? rdinit_setup+0x34/0x34
pci_iommu_init+0x12/0x3a
do_one_initcall+0x65/0x320
? rdinit_setup+0x34/0x34
? rcu_read_lock_sched_held+0x5a/0x80
kernel_init_freeable+0x28a/0x2f3
? rest_init+0x1b0/0x1b0
kernel_init+0x1a/0x130
ret_from_fork+0x1f/0x30
</TASK>
This rolls back dmar_global_lock to rcu_lock in get_resv_regions to avoid
the lockdep splat.
Fixes:
57365a04c921 ("iommu: Move bus setup to IOMMU device registration")
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20220927053109.4053662-3-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Lu Baolu [Wed, 19 Oct 2022 00:44:44 +0000 (08:44 +0800)]
iommu: Add gfp parameter to iommu_alloc_resv_region
Add gfp parameter to iommu_alloc_resv_region() for the callers to specify
the memory allocation behavior. Thus iommu_alloc_resv_region() could also
be available in critical contexts.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20220927053109.4053662-2-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Anup Patel [Fri, 21 Oct 2022 06:22:45 +0000 (11:52 +0530)]
RISC-V: KVM: Fix kvm_riscv_vcpu_timer_pending() for Sstc
The kvm_riscv_vcpu_timer_pending() checks per-VCPU next_cycles
and per-VCPU software injected VS timer interrupt. This function
returns incorrect value when Sstc is available because the per-VCPU
next_cycles are only updated by kvm_riscv_vcpu_timer_save() called
from kvm_arch_vcpu_put(). As a result, when Sstc is available the
VCPU does not block properly upon WFI traps.
To fix the above issue, we introduce kvm_riscv_vcpu_timer_sync()
which will update per-VCPU next_cycles upon every VM exit instead
of kvm_riscv_vcpu_timer_save().
Fixes:
8f5cb44b1bae ("RISC-V: KVM: Support sstc extension")
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Anup Patel <anup@brainfault.org>
Andrew Jones [Fri, 21 Oct 2022 06:22:39 +0000 (11:52 +0530)]
RISC-V: Fix compilation without RISCV_ISA_ZICBOM
riscv_cbom_block_size and riscv_init_cbom_blocksize() should always
be available and riscv_init_cbom_blocksize() should always be
invoked, even when compiling without RISCV_ISA_ZICBOM enabled. This
is because disabling RISCV_ISA_ZICBOM means "don't use zicbom
instructions in the kernel" not "pretend there isn't zicbom, even
when there is". When zicbom is available, whether the kernel enables
its use with RISCV_ISA_ZICBOM or not, KVM will offer it to guests.
Ensure we can build KVM and that the block size is initialized even
when compiling without RISCV_ISA_ZICBOM.
Fixes:
8f7e001e0325 ("RISC-V: Clean up the Zicbom block size probing")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Heiko Stuebner <heiko@sntech.de>
Tested-by: Heiko Stuebner <heiko@sntech.de>
Signed-off-by: Anup Patel <anup@brainfault.org>
Adam Borowski [Mon, 10 Oct 2022 18:33:51 +0000 (20:33 +0200)]
i2c: mlxbf: depend on ACPI; clean away ifdeffage
This fixes maybe_unused warnings/errors.
According to a comment during device tree removal, only ACPI is supported,
thus let's actually require it.
Fixes:
be18c5ede25d ("i2c: mlxbf: remove device tree support")
Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Signed-off-by: Wolfram Sang <wsa@kernel.org>
Alistair Popple [Wed, 19 Oct 2022 12:29:34 +0000 (23:29 +1100)]
nouveau: fix migrate_to_ram() for faulting page
Commit
16ce101db85d ("mm/memory.c: fix race when faulting a device private
page") changed the migrate_to_ram() callback to take a reference on the
device page to ensure it can't be freed while handling the fault.
Unfortunately the corresponding update to Nouveau to accommodate this
change was inadvertently dropped from that patch causing GPU to CPU
migration to fail so add it here.
Link: https://lkml.kernel.org/r/20221019122934.866205-1-apopple@nvidia.com
Fixes:
16ce101db85d ("mm/memory.c: fix race when faulting a device private page")
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Wed, 19 Oct 2022 13:41:56 +0000 (14:41 +0100)]
mm/huge_memory: do not clobber swp_entry_t during THP split
The following has been observed when running stressng mmap since commit
b653db77350c ("mm: Clear page->private when splitting or migrating a page")
watchdog: BUG: soft lockup - CPU#75 stuck for 26s! [stress-ng:9546]
CPU: 75 PID: 9546 Comm: stress-ng Tainted: G E 6.0.0-revert-
b653db77-fix+ #29
0357d79b60fb09775f678e4f3f64ef0579ad1374
Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016
RIP: 0010:xas_descend+0x28/0x80
Code: cc cc 0f b6 0e 48 8b 57 08 48 d3 ea 83 e2 3f 89 d0 48 83 c0 04 48 8b 44 c6 08 48 89 77 18 48 89 c1 83 e1 03 48 83 f9 02 75 08 <48> 3d fd 00 00 00 76 08 88 57 12 c3 cc cc cc cc 48 c1 e8 02 89 c2
RSP: 0018:
ffffbbf02a2236a8 EFLAGS:
00000246
RAX:
ffff9cab7d6a0002 RBX:
ffffe04b0af88040 RCX:
0000000000000002
RDX:
0000000000000030 RSI:
ffff9cab60509b60 RDI:
ffffbbf02a2236c0
RBP:
0000000000000000 R08:
ffff9cab60509b60 R09:
ffffbbf02a2236c0
R10:
0000000000000001 R11:
ffffbbf02a223698 R12:
0000000000000000
R13:
ffff9cab4e28da80 R14:
0000000000039c01 R15:
ffff9cab4e28da88
FS:
00007fab89b85e40(0000) GS:
ffff9cea3fcc0000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
00007fab84e00000 CR3:
00000040b73a4003 CR4:
00000000003706e0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
Call Trace:
<TASK>
xas_load+0x3a/0x50
__filemap_get_folio+0x80/0x370
? put_swap_page+0x163/0x360
pagecache_get_page+0x13/0x90
__try_to_reclaim_swap+0x50/0x190
scan_swap_map_slots+0x31e/0x670
get_swap_pages+0x226/0x3c0
folio_alloc_swap+0x1cc/0x240
add_to_swap+0x14/0x70
shrink_page_list+0x968/0xbc0
reclaim_page_list+0x70/0xf0
reclaim_pages+0xdd/0x120
madvise_cold_or_pageout_pte_range+0x814/0xf30
walk_pgd_range+0x637/0xa30
__walk_page_range+0x142/0x170
walk_page_range+0x146/0x170
madvise_pageout+0xb7/0x280
? asm_common_interrupt+0x22/0x40
madvise_vma_behavior+0x3b7/0xac0
? find_vma+0x4a/0x70
? find_vma+0x64/0x70
? madvise_vma_anon_name+0x40/0x40
madvise_walk_vmas+0xa6/0x130
do_madvise+0x2f4/0x360
__x64_sys_madvise+0x26/0x30
do_syscall_64+0x5b/0x80
? do_syscall_64+0x67/0x80
? syscall_exit_to_user_mode+0x17/0x40
? do_syscall_64+0x67/0x80
? syscall_exit_to_user_mode+0x17/0x40
? do_syscall_64+0x67/0x80
? do_syscall_64+0x67/0x80
? common_interrupt+0x8b/0xa0
entry_SYSCALL_64_after_hwframe+0x63/0xcd
The problem can be reproduced with the mmtests config
config-workload-stressng-mmap. It does not always happen and when it
triggers is variable but it has happened on multiple machines.
The intent of commit
b653db77350c patch was to avoid the case where
PG_private is clear but folio->private is not-NULL. However, THP tail
pages uses page->private for "swp_entry_t if folio_test_swapcache()" as
stated in the documentation for struct folio. This patch only clobbers
page->private for tail pages if the head page was not in swapcache and
warns once if page->private had an unexpected value.
Link: https://lkml.kernel.org/r/20221019134156.zjyyn5aownakvztf@techsingularity.net
Fixes:
b653db77350c ("mm: Clear page->private when splitting or migrating a page")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Kravetz [Wed, 19 Oct 2022 20:19:57 +0000 (13:19 -0700)]
hugetlb: fix memory leak associated with vma_lock structure
The hugetlb vma_lock structure hangs off the vm_private_data pointer of
sharable hugetlb vmas. The structure is vma specific and can not be
shared between vmas. At fork and various other times, vmas are duplicated
via vm_area_dup(). When this happens, the pointer in the newly created
vma must be cleared and the structure reallocated. Two hugetlb specific
routines deal with this hugetlb_dup_vma_private and hugetlb_vm_op_open.
Both routines are called for newly created vmas. hugetlb_dup_vma_private
would always clear the pointer and hugetlb_vm_op_open would allocate the
new vms_lock structure. This did not work in the case of this calling
sequence pointed out in [1].
move_vma
copy_vma
new_vma = vm_area_dup(vma);
new_vma->vm_ops->open(new_vma); --> new_vma has its own vma lock.
is_vm_hugetlb_page(vma)
clear_vma_resv_huge_pages
hugetlb_dup_vma_private --> vma->vm_private_data is set to NULL
When clearing hugetlb_dup_vma_private we actually leak the associated
vma_lock structure.
The vma_lock structure contains a pointer to the associated vma. This
information can be used in hugetlb_dup_vma_private and hugetlb_vm_op_open
to ensure we only clear the vm_private_data of newly created (copied)
vmas. In such cases, the vma->vma_lock->vma field will not point to the
vma.
Update hugetlb_dup_vma_private and hugetlb_vm_op_open to not clear
vm_private_data if vma->vma_lock->vma == vma. Also, log a warning if
hugetlb_vm_op_open ever encounters the case where vma_lock has already
been correctly allocated for the vma.
[1] https://lore.kernel.org/linux-mm/
5154292a-4c55-28cd-0935-
82441e512fc3@huawei.com/
Link: https://lkml.kernel.org/r/20221019201957.34607-1-mike.kravetz@oracle.com
Fixes:
131a79b474e9 ("hugetlb: fix vma lock handling during split vma and range unmapping")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Tue, 31 May 2022 13:20:51 +0000 (09:20 -0400)]
mm/page_alloc: reduce potential fragmentation in make_alloc_exact()
Try to avoid using the left over split page on the next request for a page
by calling __free_pages_ok() with FPI_TO_TAIL. This increases the
potential of defragmenting memory when it's used for a short period of
time.
Link: https://lkml.kernel.org/r/20220531185626.yvlmymbxyoe5vags@revolver
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Wed, 19 Oct 2022 03:18:38 +0000 (20:18 -0700)]
mm: /proc/pid/smaps_rollup: fix maple tree search
/proc/pid/smaps_rollup showed 0 kB for everything: now find first vma.
Link: https://lkml.kernel.org/r/3011bee7-182-97a2-1083-d5f5b688e54b@google.com
Fixes:
c4c84f06285e ("fs/proc/task_mmu: stop using linked list and highest_vm_end")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rik van Riel [Tue, 18 Oct 2022 00:25:05 +0000 (20:25 -0400)]
mm,hugetlb: take hugetlb_lock before decrementing h->resv_huge_pages
The h->*_huge_pages counters are protected by the hugetlb_lock, but
alloc_huge_page has a corner case where it can decrement the counter
outside of the lock.
This could lead to a corrupted value of h->resv_huge_pages, which we have
observed on our systems.
Take the hugetlb_lock before decrementing h->resv_huge_pages to avoid a
potential race.
Link: https://lkml.kernel.org/r/20221017202505.0e6a4fcd@imladris.surriel.com
Fixes:
a88c76954804 ("mm: hugetlb: fix hugepage memory leak caused by wrong reserve count")
Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Glen McCready <gkmccready@meta.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam Howlett [Tue, 18 Oct 2022 19:17:12 +0000 (19:17 +0000)]
mm/mmap: fix MAP_FIXED address return on VMA merge
mmap should return the start address of newly mapped area when successful.
On a successful merge of a VMA, the return address was changed and thus
was violating that expectation from userspace.
This is a restoration of functionality provided by
309d08d9b3a3
(mm/mmap.c: fix mmap return value when vma is merged after call_mmap()).
For completeness of fixing MAP_FIXED, implement the comments from the
previous discussion to never update the address and fail if the address
changes. Leaving the error as a WARN_ON() to avoid crashing the kernel.
Link: https://lkml.kernel.org/r/20221018191613.4133459-1-Liam.Howlett@oracle.com
Link: https://lore.kernel.org/all/Y06yk66SKxlrwwfb@lakrids/
Link: https://lore.kernel.org/all/20201203085350.22624-1-liuzixian4@huawei.com/
Fixes:
4dd1b84140c1 ("mm/mmap: use advanced maple tree API for mmap_region()")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reported-by: Mark Rutland <mark.rutland@arm.com>
Cc: Liu Zixian <liuzixian4@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Tue, 18 Oct 2022 20:57:37 +0000 (13:57 -0700)]
mm/mmap.c: __vma_adjust(): suppress uninitialized var warning
The code is OK, but it fools gcc.
mm/mmap.c:802 __vma_adjust() error: uninitialized symbol 'next_next'.
Fixes:
524e00b36e8c5 ("mm: remove rb tree.")
Reported-by: kernel test robot <lkp@intel.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Kravetz [Tue, 18 Oct 2022 02:49:45 +0000 (19:49 -0700)]
mm/mmap: undo ->mmap() when mas_preallocate() fails
A memory leak in hugetlb_reserve_pages was reported in [1]. The root
cause was traced to an error path in mmap_region when mas_preallocate()
fails. In this case, the vma is freed after a successful call to
filesystem specific mmap. The hugetlbfs mmap routine may allocate data
structures pointed to by m_private_data. These need to be cleaned up by
the hugetlb vm_ops->close() routine.
The same issue was addressed by commit
deb0f6562884 ("mm/mmap: undo
->mmap() when arch_validate_flags() fails") for the arch_validate_flags()
test. Go to the same close_and_free_vma label if mas_preallocate() fails.
[1] https://lore.kernel.org/linux-mm/CAKXUXMxf7OiCwbxib7MwfR4M1b5+b3cNTU7n5NV9Zm4967=FPQ@mail.gmail.com/
Link: https://lkml.kernel.org/r/20221018024945.415036-1-mike.kravetz@oracle.com
Fixes:
d4af56c5c7c6 ("mm: start tracking VMAs with maple tree")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Colin Ian King [Fri, 7 Oct 2022 20:43:39 +0000 (21:43 +0100)]
init: Kconfig: fix spelling mistake "satify" -> "satisfy"
There is a spelling mistake in a Kconfig description. Fix it.
Link: https://lkml.kernel.org/r/20221007204339.2757753-1-colin.i.king@gmail.com
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joseph Qi [Mon, 17 Oct 2022 13:02:27 +0000 (21:02 +0800)]
ocfs2: clear dinode links count in case of error
In ocfs2_mknod(), if error occurs after dinode successfully allocated,
ocfs2 i_links_count will not be 0.
So even though we clear inode i_nlink before iput in error handling, it
still won't wipe inode since we'll refresh inode from dinode during inode
lock. So just like clear inode i_nlink, we clear ocfs2 i_links_count as
well. Also do the same change for ocfs2_symlink().
Link: https://lkml.kernel.org/r/20221017130227.234480-2-joseph.qi@linux.alibaba.com
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reported-by: Yan Wang <wangyan122@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joseph Qi [Mon, 17 Oct 2022 13:02:26 +0000 (21:02 +0800)]
ocfs2: fix BUG when iput after ocfs2_mknod fails
Commit
b1529a41f777 "ocfs2: should reclaim the inode if
'__ocfs2_mknod_locked' returns an error" tried to reclaim the claimed
inode if __ocfs2_mknod_locked() fails later. But this introduce a race,
the freed bit may be reused immediately by another thread, which will
update dinode, e.g. i_generation. Then iput this inode will lead to BUG:
inode->i_generation != le32_to_cpu(fe->i_generation)
We could make this inode as bad, but we did want to do operations like
wipe in some cases. Since the claimed inode bit can only affect that an
dinode is missing and will return back after fsck, it seems not a big
problem. So just leave it as is by revert the reclaim logic.
Link: https://lkml.kernel.org/r/20221017130227.234480-1-joseph.qi@linux.alibaba.com
Fixes:
b1529a41f777 ("ocfs2: should reclaim the inode if '__ocfs2_mknod_locked' returns an error")
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reported-by: Yan Wang <wangyan122@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Martin Liska [Thu, 13 Oct 2022 07:40:59 +0000 (09:40 +0200)]
gcov: support GCC 12.1 and newer compilers
Starting with GCC 12.1, the created .gcda format can't be read by gcov
tool. There are 2 significant changes to the .gcda file format that
need to be supported:
a) [gcov: Use system IO buffering]
(
23eb66d1d46a34cb28c4acbdf8a1deb80a7c5a05) changed that all sizes in
the format are in bytes and not in words (4B)
b) [gcov: make profile merging smarter]
(
72e0c742bd01f8e7e6dcca64042b9ad7e75979de) add a new checksum to the
file header.
Tested with GCC 7.5, 10.4, 12.2 and the current master.
Link: https://lkml.kernel.org/r/624bda92-f307-30e9-9aaa-8cc678b2dfb2@suse.cz
Signed-off-by: Martin Liska <mliska@suse.cz>
Tested-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alexey Romanov [Thu, 13 Oct 2022 11:28:25 +0000 (14:28 +0300)]
zsmalloc: zs_destroy_pool: add size_class NULL check
Inside the zs_destroy_pool() function, there can still be NULL size_class
pointers: if when the next size_class is allocated, inside
zs_create_pool() function, kzalloc will return NULL and handling the error
condition, zs_create_pool() will call zs_destroy_pool().
Link: https://lkml.kernel.org/r/20221013112825.61869-1-avromanov@sberdevices.ru
Fixes:
f24263a5a076 ("zsmalloc: remove unnecessary size_class NULL check")
Signed-off-by: Alexey Romanov <avromanov@sberdevices.ru>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam Howlett [Sat, 15 Oct 2022 02:12:33 +0000 (02:12 +0000)]
mm/mempolicy: fix mbind_range() arguments to vma_merge()
Fuzzing produced an invalid argument to vma_merge() which was caught by
the newly added verification of the number of VMAs being removed on
process exit. Analyzing the failure eventually resulted in finding an
issue with the search of a VMA that started at address 0, which caused an
underflow and thus the loss of many VMAs being tracked in the tree. Fix
the underflow by changing the search of the maple tree to use the start
address directly.
Link: https://lkml.kernel.org/r/20221015021135.2816178-1-Liam.Howlett@oracle.com
Fixes:
66850be55e8e ("mm/mempolicy: use vma iterator & maple state instead of vma linked list")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/r/202210052318.5ad10912-oliver.sang@intel.com
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Qais Yousef [Fri, 14 Oct 2022 14:10:16 +0000 (15:10 +0100)]
mailmap: update email for Qais Yousef
Update my email address for old entry and add a new entry for my
contribution while working with arm to continue support that work.
Link: https://lkml.kernel.org/r/20221014141016.539625-1-qyousef@layalina.io
Signed-off-by: Qais Yousef <qyousef@layalina.io>
Acked-by: Qais Yousef <qais.yousef@arm.com>
Acked-by: Qais Yousef <qsyousef@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>