Yanfei Xu [Thu, 15 Jul 2021 04:26:49 +0000 (21:26 -0700)]
mm/page_alloc: correct return value when failing at preparing
If the array passed in is already partially populated, we should return
"nr_populated" even failing at preparing arguments stage.
Link: https://lkml.kernel.org/r/20210713152100.10381-3-mgorman@techsingularity.net
Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lore.kernel.org/r/20210709102855.55058-1-yanfei.xu@windriver.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Thu, 15 Jul 2021 04:26:46 +0000 (21:26 -0700)]
mm/page_alloc: avoid page allocator recursion with pagesets.lock held
Syzbot is reporting potential deadlocks due to pagesets.lock when
PAGE_OWNER is enabled. One example from Desmond Cheong Zhi Xi is as
follows
__alloc_pages_bulk()
local_lock_irqsave(&pagesets.lock, flags) <---- outer lock here
prep_new_page():
post_alloc_hook():
set_page_owner():
__set_page_owner():
save_stack():
stack_depot_save():
alloc_pages():
alloc_page_interleave():
__alloc_pages():
get_page_from_freelist():
rm_queue():
rm_queue_pcplist():
local_lock_irqsave(&pagesets.lock, flags);
*** DEADLOCK ***
Zhang, Qiang also reported
BUG: sleeping function called from invalid context at mm/page_alloc.c:5179
in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 1, name: swapper/0
.....
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:96
___might_sleep.cold+0x1f1/0x237 kernel/sched/core.c:9153
prepare_alloc_pages+0x3da/0x580 mm/page_alloc.c:5179
__alloc_pages+0x12f/0x500 mm/page_alloc.c:5375
alloc_page_interleave+0x1e/0x200 mm/mempolicy.c:2147
alloc_pages+0x238/0x2a0 mm/mempolicy.c:2270
stack_depot_save+0x39d/0x4e0 lib/stackdepot.c:303
save_stack+0x15e/0x1e0 mm/page_owner.c:120
__set_page_owner+0x50/0x290 mm/page_owner.c:181
prep_new_page mm/page_alloc.c:2445 [inline]
__alloc_pages_bulk+0x8b9/0x1870 mm/page_alloc.c:5313
alloc_pages_bulk_array_node include/linux/gfp.h:557 [inline]
vm_area_alloc_pages mm/vmalloc.c:2775 [inline]
__vmalloc_area_node mm/vmalloc.c:2845 [inline]
__vmalloc_node_range+0x39d/0x960 mm/vmalloc.c:2947
__vmalloc_node mm/vmalloc.c:2996 [inline]
vzalloc+0x67/0x80 mm/vmalloc.c:3066
There are a number of ways it could be fixed. The page owner code could
be audited to strip GFP flags that allow sleeping but it'll impair the
functionality of PAGE_OWNER if allocations fail. The bulk allocator could
add a special case to release/reacquire the lock for prep_new_page and
lookup PCP after the lock is reacquired at the cost of performance. The
pages requiring prep could be tracked using the least significant bit and
looping through the array although it is more complicated for the list
interface. The options are relatively complex and the second one still
incurs a performance penalty when PAGE_OWNER is active so this patch takes
the simple approach -- disable bulk allocation of PAGE_OWNER is active.
The caller will be forced to allocate one page at a time incurring a
performance penalty but PAGE_OWNER is already a performance penalty.
Link: https://lkml.kernel.org/r/20210708081434.GV3840@techsingularity.net
Fixes:
dbbee9d5cd83 ("mm/page_alloc: convert per-cpu list protection to local_lock")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Reported-by: "Zhang, Qiang" <Qiang.Zhang@windriver.com>
Reported-by: syzbot+127fd7828d6eeb611703@syzkaller.appspotmail.com
Tested-by: syzbot+127fd7828d6eeb611703@syzkaller.appspotmail.com
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matteo Croce [Thu, 15 Jul 2021 04:26:43 +0000 (21:26 -0700)]
Revert "mm/page_alloc: make should_fail_alloc_page() static"
This reverts commit
f7173090033c70886d925995e9dfdfb76dbb2441.
Fix an unresolved symbol error when CONFIG_DEBUG_INFO_BTF=y:
LD vmlinux
BTFIDS vmlinux
FAILED unresolved symbol should_fail_alloc_page
make: *** [Makefile:1199: vmlinux] Error 255
make: *** Deleting file 'vmlinux'
Link: https://lkml.kernel.org/r/20210708191128.153796-1-mcroce@linux.microsoft.com
Fixes:
f7173090033c ("mm/page_alloc: make should_fail_alloc_page() static")
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Marco Elver [Thu, 15 Jul 2021 04:26:40 +0000 (21:26 -0700)]
kasan: fix build by including kernel.h
The <linux/kasan.h> header relies on _RET_IP_ being defined, and had been
receiving that definition via inclusion of bug.h which includes kernel.h.
However, since
f39650de687e ("kernel.h: split out panic and oops helpers")
that is no longer the case and get the following build error when building
CONFIG_KASAN_HW_TAGS on arm64:
In file included from arch/arm64/mm/kasan_init.c:10:
include/linux/kasan.h: In function 'kasan_slab_free':
include/linux/kasan.h:230:39: error: '_RET_IP_' undeclared (first use in this function)
230 | return __kasan_slab_free(s, object, _RET_IP_, init);
Fix it by including kernel.h from kasan.h.
Link: https://lkml.kernel.org/r/20210705072716.2125074-1-elver@google.com
Fixes:
f39650de687e ("kernel.h: split out panic and oops helpers")
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yee Lee [Thu, 15 Jul 2021 04:26:37 +0000 (21:26 -0700)]
kasan: add memzero init for unaligned size at DEBUG
Issue: when SLUB debug is on, hwtag kasan_unpoison() would overwrite the
redzone of object with unaligned size.
An additional memzero_explicit() path is added to replacing init by hwtag
instruction for those unaligned size at SLUB debug mode.
The penalty is acceptable since they are only enabled in debug mode, not
production builds. A block of comment is added for explanation.
Link: https://lkml.kernel.org/r/20210705103229.8505-3-yee.lee@mediatek.com
Signed-off-by: Yee Lee <yee.lee@mediatek.com>
Suggested-by: Andrey Konovalov <andreyknvl@gmail.com>
Suggested-by: Marco Elver <elver@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Nicholas Tang <nicholas.tang@mediatek.com>
Cc: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Marco Elver [Thu, 15 Jul 2021 04:26:34 +0000 (21:26 -0700)]
mm: move helper to check slub_debug_enabled
Move the helper to check slub_debug_enabled, so that we can confine the
use of #ifdef outside slub.c as well.
Link: https://lkml.kernel.org/r/20210705103229.8505-2-yee.lee@mediatek.com
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Yee Lee <yee.lee@mediatek.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Cc: Nicholas Tang <nicholas.tang@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Darrick J. Wong [Wed, 14 Jul 2021 16:03:41 +0000 (09:03 -0700)]
xfs: detect misaligned rtinherit directory extent size hints
If we encounter a directory that has been configured to pass on an
extent size hint to a new realtime file and the hint isn't an integer
multiple of the rt extent size, we should flag the hint for
administrative review because that is a misconfiguration (that other
parts of the kernel will fix automatically).
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 12 Jul 2021 19:58:49 +0000 (12:58 -0700)]
xfs: fix an integer overflow error in xfs_growfs_rt
During a realtime grow operation, we run a single transaction for each
rt bitmap block added to the filesystem. This means that each step has
to be careful to increase sb_rblocks appropriately.
Fix the integer overflow error in this calculation that can happen when
the extent size is very large. Found by running growfs to add a rt
volume to a filesystem formatted with a 1g rt extent size.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 12 Jul 2021 19:58:48 +0000 (12:58 -0700)]
xfs: improve FSGROWFSRT precondition checking
Improve the checking at the start of a realtime grow operation so that
we avoid accidentally set a new extent size that is too large and avoid
adding an rt volume to a filesystem with rmap or reflink because we
don't support rt rmap or reflink yet.
While we're at it, separate the checks so that we're only testing one
aspect at a time.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 12 Jul 2021 19:58:51 +0000 (12:58 -0700)]
xfs: don't expose misaligned extszinherit hints to userspace
Commit
603f000b15f2 changed xfs_ioctl_setattr_check_extsize to reject an
attempt to set an EXTSZINHERIT extent size hint on a directory with
RTINHERIT set if the hint isn't a multiple of the realtime extent size.
However, I have recently discovered that it is possible to change the
realtime extent size when adding a rt device to a filesystem, which
means that the existence of directories with misaligned inherited hints
is not an accident.
As a result, it's possible that someone could have set a valid hint and
added an rt volume with a different rt extent size, which invalidates
the ondisk hints. After such a sequence, FSGETXATTR will report a
misaligned hint, which FSSETXATTR will trip over, causing confusion if
the user was doing the usual GET/SET sequence to change some other
attribute. Change xfs_fill_fsxattr to omit the hint if it isn't aligned
properly.
Fixes:
603f000b15f2 ("xfs: validate extsz hints against rt extent size when rtinherit is set")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 12 Jul 2021 19:58:50 +0000 (12:58 -0700)]
xfs: correct the narrative around misaligned rtinherit/extszinherit dirs
While auditing the realtime growfs code, I realized that the GROWFSRT
ioctl (and by extension xfs_growfs) has always allowed sysadmins to
change the realtime extent size when adding a realtime section to the
filesystem. Since we also have always allowed sysadmins to set
RTINHERIT and EXTSZINHERIT on directories even if there is no realtime
device, this invalidates the premise laid out in the comments added in
commit
603f000b15f2.
In other words, this is not a case of inadequate metadata validation.
This is a case of nearly forgotten (and apparently untested) but
supported functionality. Update the comments to reflect what we've
learned, and remove the log message about correcting the misalignment.
Fixes:
603f000b15f2 ("xfs: validate extsz hints against rt extent size when rtinherit is set")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 12 Jul 2021 19:58:48 +0000 (12:58 -0700)]
xfs: reset child dir '..' entry when unlinking child
While running xfs/168, I noticed a second source of post-shrink
corruption errors causing shutdowns.
Let's say that directory B has a low inode number and is a child of
directory A, which has a high number. If B is empty but open, and
unlinked from A, B's dotdot link continues to point to A. If A is then
unlinked and the filesystem shrunk so that A is no longer a valid inode,
a subsequent AIL push of B will trip the inode verifiers because the
dotdot entry points outside of the filesystem.
To avoid this problem, reset B's dotdot entry to the root directory when
unlinking directories, since the root directory cannot be removed.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Darrick J. Wong [Mon, 12 Jul 2021 19:58:47 +0000 (12:58 -0700)]
xfs: check for sparse inode clusters that cross new EOAG when shrinking
While running xfs/168, I noticed occasional write verifier shutdowns
involving inodes at the very end of the filesystem. Existing inode
btree validation code checks that all inode clusters are fully contained
within the filesystem.
However, due to inadequate checking in the fs shrink code, it's possible
that there could be a sparse inode cluster at the end of the filesystem
where the upper inodes of the cluster are marked as holes and the
corresponding blocks are free. In this case, the last blocks in the AG
are listed in the bnobt. This enables the shrink to proceed but results
in a filesystem that trips the inode verifiers. Fix this by disallowing
the shrink.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Andreas Gruenbacher [Thu, 15 Jul 2021 16:58:06 +0000 (09:58 -0700)]
iomap: Don't create iomap_page objects in iomap_page_mkwrite_actor
Now that we create those objects in iomap_writepage_map when needed,
there's no need to pre-create them in iomap_page_mkwrite_actor anymore.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Andreas Gruenbacher [Thu, 15 Jul 2021 16:58:05 +0000 (09:58 -0700)]
iomap: Don't create iomap_page objects for inline files
In iomap_readpage_actor, don't create iop objects for inline inodes.
Otherwise, iomap_read_inline_data will set PageUptodate without setting
iop->uptodate, and iomap_page_release will eventually complain.
To prevent this kind of bug from occurring in the future, make sure the
page doesn't have private data attached in iomap_read_inline_data.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Andreas Gruenbacher [Thu, 15 Jul 2021 16:58:05 +0000 (09:58 -0700)]
iomap: Permit pages without an iop to enter writeback
Create an iop in the writeback path if one doesn't exist. This allows us
to avoid creating the iop in some cases. We'll initially do that for pages
with inline data, but it can be extended to pages which are entirely within
an extent. It also allows for an iop to be removed from pages in the
future (eg page split).
Co-developed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Thu, 15 Jul 2021 16:58:04 +0000 (09:58 -0700)]
iomap: remove the length variable in iomap_seek_hole
The length variable is rather pointless given that it can be trivially
deduced from offset and size. Also the initial calculation can lead
to KASAN warnings.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Leizhen (ThunderTown) <thunder.leizhen@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Christoph Hellwig [Thu, 15 Jul 2021 16:58:04 +0000 (09:58 -0700)]
iomap: remove the length variable in iomap_seek_data
The length variable is rather pointless given that it can be trivially
deduced from offset and size. Also the initial calculation can lead
to KASAN warnings.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Leizhen (ThunderTown) <thunder.leizhen@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Mark Rutland [Thu, 15 Jul 2021 12:30:49 +0000 (13:30 +0100)]
arm64: entry: fix KCOV suppression
We suppress KCOV for entry.o rather than entry-common.o. As entry.o is
built from entry.S, this is pointless, and permits instrumentation of
entry-common.o, which is built from entry-common.c.
Fix the Makefile to suppress KCOV for entry-common.o, as we had intended
to begin with. I've verified with objdump that this is working as
expected.
Fixes:
bf6fa2c0dda7 ("arm64: entry: don't instrument entry code with KCOV")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210715123049.9990-1-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Mark Rutland [Wed, 14 Jul 2021 17:28:01 +0000 (18:28 +0100)]
arm64: entry: add missing noinstr
We intend that all the early exception handling code is marked as
`noinstr`, but we forgot this for __el0_error_handler_common(), which is
called before we have completed entry from user mode. If it were
instrumented, we could run into problems with RCU, lockdep, etc.
Mark it as `noinstr` to prevent this.
The few other functions in entry-common.c which do not have `noinstr` are
called once we've completed entry, and are safe to instrument.
Fixes:
bb8e93a287a5 ("arm64: entry: convert SError handlers to C")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210714172801.16475-1-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Mark Rutland [Wed, 14 Jul 2021 14:38:41 +0000 (15:38 +0100)]
arm64: mte: fix restoration of GCR_EL1 from suspend
Since commit:
bad1e1c663e0a72f ("arm64: mte: switch GCR_EL1 in kernel entry and exit")
we saved/restored the user GCR_EL1 value at exception boundaries, and
update_gcr_el1_excl() is no longer used for this. However it is used to
restore the kernel's GCR_EL1 value when returning from a suspend state.
Thus, the comment is misleading (and an ISB is necessary).
When restoring the kernel's GCR value, we need an ISB to ensure this is
used by subsequent instructions. We don't necessarily get an ISB by
other means (e.g. if the kernel is built without support for pointer
authentication). As __cpu_setup() initialised GCR_EL1.Exclude to 0xffff,
until a context synchronization event, allocation tag 0 may be used
rather than the desired set of tags.
This patch drops the misleading comment, adds the missing ISB, and for
clarity folds update_gcr_el1_excl() into its only user.
Fixes:
bad1e1c663e0 ("arm64: mte: switch GCR_EL1 in kernel entry and exit")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210714143843.56537-2-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Robin Murphy [Mon, 12 Jul 2021 14:27:46 +0000 (15:27 +0100)]
arm64: Avoid premature usercopy failure
Al reminds us that the usercopy API must only return complete failure
if absolutely nothing could be copied. Currently, if userspace does
something silly like giving us an unaligned pointer to Device memory,
or a size which overruns MTE tag bounds, we may fail to honour that
requirement when faulting on a multi-byte access even though a smaller
access could have succeeded.
Add a mitigation to the fixup routines to fall back to a single-byte
copy if we faulted on a larger access before anything has been written
to the destination, to guarantee making *some* forward progress. We
needn't be too concerned about the overall performance since this should
only occur when callers are doing something a bit dodgy in the first
place. Particularly broken userspace might still be able to trick
generic_perform_write() into an infinite loop by targeting write() at
an mmap() of some read-only device register where the fault-in load
succeeds but any store synchronously aborts such that copy_to_user() is
genuinely unable to make progress, but, well, don't do that...
CC: stable@vger.kernel.org
Reported-by: Chen Huang <chenhuang5@huawei.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/dc03d5c675731a1f24a62417dba5429ad744234e.1626098433.git.robin.murphy@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Christoph Hellwig [Thu, 15 Jul 2021 14:17:11 +0000 (16:17 +0200)]
xen-blkfront: sanitize the removal state machine
xen-blkfront has a weird protocol where close message from the remote
side can be delayed, and where hot removals are treated somewhat
differently from regular removals, all leading to potential NULL
pointer removals, and a del_gendisk from the block device release
method, which will deadlock. Fix this by just performing normal hot
removals even when the device is opened like all other Linux block
drivers.
Fixes:
c76f48eb5c08 ("block: take bd_mutex around delete_partitions in del_gendisk")
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20210715141711.1257293-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 15 Jul 2021 15:31:36 +0000 (09:31 -0600)]
Merge tag 'nvme-5.14-2021-07-15' of git://git.infradead.org/nvme into block-5.14
Pull NVMe fixes from Christoph:
"nvme fixes for Linux 5.14
- fix various races in nvme-pci when shutting down just after probing
(Casey Chen)
- fix a net_device leak in nvme-tcp (Prabhakar Kushwaha)"
* tag 'nvme-5.14-2021-07-15' of git://git.infradead.org/nvme:
nvme-pci: do not call nvme_dev_remove_admin from nvme_remove
nvme-pci: fix multiple races in nvme_setup_io_queues
nvme-tcp: use __dev_get_by_name instead dev_get_by_name for OPT_HOST_IFACE
Wang Qing [Tue, 6 Jul 2021 04:00:16 +0000 (12:00 +0800)]
nbd: fix order of cleaning up the queue and freeing the tagset
We must release the queue before freeing the tagset.
Fixes:
4af5f2e03013 ("nbd: use blk_mq_alloc_disk and blk_cleanup_disk")
Reported-and-tested-by: syzbot+9ca43ff47167c0ee3466@syzkaller.appspotmail.com
Signed-off-by: Wang Qing <wangqing@vivo.com>
Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210706040016.1360412-1-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Guoqing Jiang [Tue, 6 Jul 2021 01:07:34 +0000 (09:07 +0800)]
pd: fix order of cleaning up the queue and freeing the tagset
We must release the queue before freeing the tagset.
Fixes:
262d431f9000 ("pd: use blk_mq_alloc_disk and blk_cleanup_disk")
Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210706010734.1356066-1-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rob Herring [Tue, 13 Jul 2021 19:35:14 +0000 (13:35 -0600)]
dt-bindings: Move fixed string 'patternProperties' to 'properties'
There's no need for fixed strings to be under 'patternProperties', so move
them under 'properties' instead.
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Kishon Vijay Abraham I <kishon@ti.com>
Cc: Vinod Koul <vkoul@kernel.org>
Cc: Saravanan Sekar <sravanhome@gmail.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Jagan Teki <jagan@amarulasolutions.com>
Cc: Troy Kisky <troy.kisky@boundarydevices.com>
Cc: linux-hwmon@vger.kernel.org
Cc: linux-phy@lists.infradead.org
Cc: linux-spi@vger.kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
Acked-by: Mark Brown <broonie@kernel.org>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20210713193514.690894-1-robh@kernel.org
Rob Herring [Tue, 13 Jul 2021 19:34:53 +0000 (13:34 -0600)]
dt-bindings: More dropping redundant minItems/maxItems
Another round of removing redundant minItems/maxItems from new schema in
the recent merge window.
If a property has an 'items' list, then a 'minItems' or 'maxItems' with the
same size as the list is redundant and can be dropped. Note that is DT
schema specific behavior and not standard json-schema behavior. The tooling
will fixup the final schema adding any unspecified minItems/maxItems.
This condition is partially checked with the meta-schema already, but
only if both 'minItems' and 'maxItems' are equal to the 'items' length.
An improved meta-schema is pending.
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Will Deacon <will@kernel.org>
Cc: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Sureshkumar Relli <naga.sureshkumar.relli@xilinx.com>
Cc: Brian Norris <computersforpeace@gmail.com>
Cc: Kamal Dasu <kdasu.kdev@gmail.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Cc: linux-clk@vger.kernel.org
Cc: iommu@lists.linux-foundation.org
Cc: linux-mtd@lists.infradead.org
Cc: linux-rtc@vger.kernel.org
Cc: linux-usb@vger.kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Link: https://lore.kernel.org/r/20210713193453.690290-1-robh@kernel.org
Vitaly Kuznetsov [Mon, 28 Jun 2021 10:44:25 +0000 (12:44 +0200)]
KVM: selftests: smm_test: Test SMM enter from L2
Two additional tests are added:
- SMM triggered from L2 does not currupt L1 host state.
- Save/restore during SMM triggered from L2 does not corrupt guest/host
state.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <
20210628104425.391276-7-vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Vitaly Kuznetsov [Mon, 28 Jun 2021 10:44:24 +0000 (12:44 +0200)]
KVM: nSVM: Restore nested control upon leaving SMM
If the VM was migrated while in SMM, no nested state was saved/restored,
and therefore svm_leave_smm has to load both save and control area
of the vmcb12. Save area is already loaded from HSAVE area,
so now load the control area as well from the vmcb12.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <
20210628104425.391276-6-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Vitaly Kuznetsov [Mon, 28 Jun 2021 10:44:23 +0000 (12:44 +0200)]
KVM: nSVM: Fix L1 state corruption upon return from SMM
VMCB split commit
4995a3685f1b ("KVM: SVM: Use a separate vmcb for the
nested L2 guest") broke return from SMM when we entered there from guest
(L2) mode. Gen2 WS2016/Hyper-V is known to do this on boot. The problem
manifests itself like this:
kvm_exit: reason EXIT_RSM rip 0x7ffbb280 info 0 0
kvm_emulate_insn: 0:
7ffbb280: 0f aa
kvm_smm_transition: vcpu 0: leaving SMM, smbase 0x7ffb3000
kvm_nested_vmrun: rip: 0x000000007ffbb280 vmcb: 0x0000000008224000
nrip: 0xffffffffffbbe119 int_ctl: 0x01020000 event_inj: 0x00000000
npt: on
kvm_nested_intercepts: cr_read: 0000 cr_write: 0010 excp:
40060002
intercepts:
fd44bfeb 0000217f 00000000
kvm_entry: vcpu 0, rip 0xffffffffffbbe119
kvm_exit: reason EXIT_NPF rip 0xffffffffffbbe119 info
200000006 1ab000
kvm_nested_vmexit: vcpu 0 reason npf rip 0xffffffffffbbe119 info1
0x0000000200000006 info2 0x00000000001ab000 intr_info 0x00000000
error_code 0x00000000
kvm_page_fault: address 1ab000 error_code 6
kvm_nested_vmexit_inject: reason EXIT_NPF info1
200000006 info2 1ab000
int_info 0 int_info_err 0
kvm_entry: vcpu 0, rip 0x7ffbb280
kvm_exit: reason EXIT_EXCP_GP rip 0x7ffbb280 info 0 0
kvm_emulate_insn: 0:
7ffbb280: 0f aa
kvm_inj_exception: #GP (0x0)
Note: return to L2 succeeded but upon first exit to L1 its RIP points to
'RSM' instruction but we're not in SMM.
The problem appears to be that VMCB01 gets irreversibly destroyed during
SMM execution. Previously, we used to have 'hsave' VMCB where regular
(pre-SMM) L1's state was saved upon nested_svm_vmexit() but now we just
switch to VMCB01 from VMCB02.
Pre-split (working) flow looked like:
- SMM is triggered during L2's execution
- L2's state is pushed to SMRAM
- nested_svm_vmexit() restores L1's state from 'hsave'
- SMM -> RSM
- enter_svm_guest_mode() switches to L2 but keeps 'hsave' intact so we have
pre-SMM (and pre L2 VMRUN) L1's state there
- L2's state is restored from SMRAM
- upon first exit L1's state is restored from L1.
This was always broken with regards to svm_get_nested_state()/
svm_set_nested_state(): 'hsave' was never a part of what's being
save and restored so migration happening during SMM triggered from L2 would
never restore L1's state correctly.
Post-split flow (broken) looks like:
- SMM is triggered during L2's execution
- L2's state is pushed to SMRAM
- nested_svm_vmexit() switches to VMCB01 from VMCB02
- SMM -> RSM
- enter_svm_guest_mode() switches from VMCB01 to VMCB02 but pre-SMM VMCB01
is already lost.
- L2's state is restored from SMRAM
- upon first exit L1's state is restored from VMCB01 but it is corrupted
(reflects the state during 'RSM' execution).
VMX doesn't have this problem because unlike VMCB, VMCS keeps both guest
and host state so when we switch back to VMCS02 L1's state is intact there.
To resolve the issue we need to save L1's state somewhere. We could've
created a third VMCB for SMM but that would require us to modify saved
state format. L1's architectural HSAVE area (pointed by MSR_VM_HSAVE_PA)
seems appropriate: L0 is free to save any (or none) of L1's state there.
Currently, KVM does 'none'.
Note, for nested state migration to succeed, both source and destination
hypervisors must have the fix. We, however, don't need to create a new
flag indicating the fact that HSAVE area is now populated as migration
during SMM triggered from L2 was always broken.
Fixes:
4995a3685f1b ("KVM: SVM: Use a separate vmcb for the nested L2 guest")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Vitaly Kuznetsov [Mon, 28 Jun 2021 10:44:22 +0000 (12:44 +0200)]
KVM: nSVM: Introduce svm_copy_vmrun_state()
Separate the code setting non-VMLOAD-VMSAVE state from
svm_set_nested_state() into its own function. This is going to be
re-used from svm_enter_smm()/svm_leave_smm().
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <
20210628104425.391276-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Vitaly Kuznetsov [Mon, 28 Jun 2021 10:44:21 +0000 (12:44 +0200)]
KVM: nSVM: Check that VM_HSAVE_PA MSR was set before VMRUN
APM states that "The address written to the VM_HSAVE_PA MSR, which holds
the address of the page used to save the host state on a VMRUN, must point
to a hypervisor-owned page. If this check fails, the WRMSR will fail with
a #GP(0) exception. Note that a value of 0 is not considered valid for the
VM_HSAVE_PA MSR and a VMRUN that is attempted while the HSAVE_PA is 0 will
fail with a #GP(0) exception."
svm_set_msr() already checks that the supplied address is valid, so only
check for '0' is missing. Add it to nested_svm_vmrun().
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <
20210628104425.391276-3-vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Vitaly Kuznetsov [Mon, 28 Jun 2021 10:44:20 +0000 (12:44 +0200)]
KVM: nSVM: Check the value written to MSR_VM_HSAVE_PA
APM states that #GP is raised upon write to MSR_VM_HSAVE_PA when
the supplied address is not page-aligned or is outside of "maximum
supported physical address for this implementation".
page_address_valid() check seems suitable. Also, forcefully page-align
the address when it's written from VMM.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <
20210628104425.391276-2-vkuznets@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
[Add comment about behavior for host-provided values. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sean Christopherson [Thu, 6 May 2021 17:58:26 +0000 (10:58 -0700)]
KVM: SVM: Fix sev_pin_memory() error checks in SEV migration utilities
Use IS_ERR() instead of checking for a NULL pointer when querying for
sev_pin_memory() failures. sev_pin_memory() always returns an error code
cast to a pointer, or a valid pointer; it never returns NULL.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Steve Rutherford <srutherford@google.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Fixes:
d3d1af85e2c7 ("KVM: SVM: Add KVM_SEND_UPDATE_DATA command")
Fixes:
15fb7de1a7f5 ("KVM: SVM: Add KVM_SEV_RECEIVE_UPDATE_DATA command")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <
20210506175826.2166383-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sean Christopherson [Thu, 6 May 2021 17:58:25 +0000 (10:58 -0700)]
KVM: SVM: Return -EFAULT if copy_to_user() for SEV mig packet header fails
Return -EFAULT if copy_to_user() fails; if accessing user memory faults,
copy_to_user() returns the number of bytes remaining, not an error code.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Steve Rutherford <srutherford@google.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Fixes:
d3d1af85e2c7 ("KVM: SVM: Add KVM_SEND_UPDATE_DATA command")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <
20210506175826.2166383-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Maxim Levitsky [Wed, 7 Jul 2021 12:51:00 +0000 (15:51 +0300)]
KVM: SVM: add module param to control the #SMI interception
In theory there are no side effects of not intercepting #SMI,
because then #SMI becomes transparent to the OS and the KVM.
Plus an observation on recent Zen2 CPUs reveals that these
CPUs ignore #SMI interception and never deliver #SMI VMexits.
This is also useful to test nested KVM to see that L1
handles #SMIs correctly in case when L1 doesn't intercept #SMI.
Finally the default remains the same, the SMI are intercepted
by default thus this patch doesn't have any effect unless
non default module param value is used.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <
20210707125100.677203-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Maxim Levitsky [Wed, 7 Jul 2021 12:50:59 +0000 (15:50 +0300)]
KVM: SVM: remove INIT intercept handler
Kernel never sends real INIT even to CPUs, other than on boot.
Thus INIT interception is an error which should be caught
by a check for an unknown VMexit reason.
On top of that, the current INIT VM exit handler skips
the current instruction which is wrong.
That was added in commit
5ff3a351f687 ("KVM: x86: Move trivial
instruction-based exit handlers to common code").
Fixes:
5ff3a351f687 ("KVM: x86: Move trivial instruction-based exit handlers to common code")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <
20210707125100.677203-3-mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Maxim Levitsky [Wed, 7 Jul 2021 12:50:58 +0000 (15:50 +0300)]
KVM: SVM: #SMI interception must not skip the instruction
Commit
5ff3a351f687 ("KVM: x86: Move trivial instruction-based
exit handlers to common code"), unfortunately made a mistake of
treating nop_on_interception and nop_interception in the same way.
Former does truly nothing while the latter skips the instruction.
SMI VM exit handler should do nothing.
(SMI itself is handled by the host when we do STGI)
Fixes:
5ff3a351f687 ("KVM: x86: Move trivial instruction-based exit handlers to common code")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <
20210707125100.677203-2-mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Yu Zhang [Wed, 7 Jul 2021 23:57:02 +0000 (07:57 +0800)]
KVM: VMX: Remove vmx_msr_index from vmx.h
vmx_msr_index was used to record the list of MSRs which can be lazily
restored when kvm returns to userspace. It is now reimplemented as
kvm_uret_msrs_list, a common x86 list which is only used inside x86.c.
So just remove the obsolete declaration in vmx.h.
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Message-Id: <
20210707235702.31595-1-yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Lai Jiangshan [Mon, 28 Jun 2021 17:26:32 +0000 (01:26 +0800)]
KVM: X86: Disable hardware breakpoints unconditionally before kvm_x86->run()
When the host is using debug registers but the guest is not using them
nor is the guest in guest-debug state, the kvm code does not reset
the host debug registers before kvm_x86->run(). Rather, it relies on
the hardware vmentry instruction to automatically reset the dr7 registers
which ensures that the host breakpoints do not affect the guest.
This however violates the non-instrumentable nature around VM entry
and exit; for example, when a host breakpoint is set on vcpu->arch.cr2,
Another issue is consistency. When the guest debug registers are active,
the host breakpoints are reset before kvm_x86->run(). But when the
guest debug registers are inactive, the host breakpoints are delayed to
be disabled. The host tracing tools may see different results depending
on what the guest is doing.
To fix the problems, we clear %db7 unconditionally before kvm_x86->run()
if the host has set any breakpoints, no matter if the guest is using
them or not.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <
20210628172632.81029-1-jiangshanlai@gmail.com>
Cc: stable@vger.kernel.org
[Only clear %db7 instead of reloading all debug registers. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Ricardo Koller [Fri, 2 Jul 2021 20:10:42 +0000 (13:10 -0700)]
KVM: selftests: Address extra memslot parameters in vm_vaddr_alloc
Commit
a75a895e6457 ("KVM: selftests: Unconditionally use memslot 0 for
vaddr allocations") removed the memslot parameters from vm_vaddr_alloc.
It addressed all callers except one under lib/aarch64/, due to a race
with commit
e3db7579ef35 ("KVM: selftests: Add exception handling
support for aarch64")
Fix the vm_vaddr_alloc call in lib/aarch64/processor.c.
Reported-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
Message-Id: <
20210702201042.4036162-1-ricarkol@google.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pavel Skripkin [Thu, 1 Jul 2021 19:55:00 +0000 (22:55 +0300)]
kvm: debugfs: fix memory leak in kvm_create_vm_debugfs
In commit
bc9e9e672df9 ("KVM: debugfs: Reuse binary stats descriptors")
loop for filling debugfs_stat_data was copy-pasted 2 times, but
in the second loop pointers are saved over pointers allocated
in the first loop. All this causes is a memory leak, fix it.
Fixes:
bc9e9e672df9 ("KVM: debugfs: Reuse binary stats descriptors")
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Reviewed-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <
20210701195500.27097-1-paskripkin@gmail.com>
Reviewed-by: Jing Zhang <jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Thierry Reding [Tue, 22 Jun 2021 11:33:27 +0000 (13:33 +0200)]
dt-bindings: net: dsa: sja1105: Fix indentation warnings
Some of the lines aren't properly indented, causing yamllint to warn
about them:
.../nxp,sja1105.yaml:70:17: [warning] wrong indentation: expected 18 but found 16 (indentation)
Use the proper indentation to fix those warnings.
Signed-off-by: Thierry Reding <treding@nvidia.com>
Fixes:
070f5b701d559ae1 ("dt-bindings: net: dsa: sja1105: add SJA1110 bindings")
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://lore.kernel.org/r/20210622113327.3613595-1-thierry.reding@gmail.com
Signed-off-by: Rob Herring <robh@kernel.org>
Hu Haowen [Sun, 20 Jun 2021 01:04:44 +0000 (09:04 +0800)]
docs/zh_CN: add a missing space character
"LinusTorvalds" is not pretty. Replace it with "Linus Torvalds".
Signed-off-by: Hu Haowen <src.res@email.cn>
Link: https://lore.kernel.org/r/20210620010444.24813-1-src.res@email.cn
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Ingo Molnar [Thu, 1 Jul 2021 11:31:17 +0000 (13:31 +0200)]
Documentation/features: Add THREAD_INFO_IN_TASK feature matrix
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/YN2nhV5F0hBVNPuX@gmail.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Ingo Molnar [Thu, 1 Jul 2021 11:31:52 +0000 (13:31 +0200)]
Documentation/features: Update the ARCH_HAS_TICK_BROADCAST entry
Risc-V gained support recently.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/YN2nqOVHgGDt4Iid@gmail.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Nishanth Menon [Sat, 3 Jul 2021 01:29:31 +0000 (20:29 -0500)]
LICENSES/dual/CC-BY-4.0: Git rid of "smart quotes"
A couple of exotic quote characters came in with this license text; they
can confuse software that is not expecting non-ASCII text. Switch to
normal quotes here, with no changes to the actual license text.
Reported-by: Rahul T R <r-ravikumar@ti.com>
Signed-off-by: Nishanth Menon <nm@ti.com>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Thorsten Leemhuis <linux@leemhuis.info>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20210703012931.30604-1-nm@ti.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Nicholas Piggin [Mon, 12 Jul 2021 01:36:50 +0000 (11:36 +1000)]
KVM: PPC: Book3S HV P9: Fix guest TM support
The conversion to C introduced several bugs in TM handling that can
cause host crashes with TM bad thing interrupts. Mostly just simple
typos or missed logic in the conversion that got through due to my
not testing TM in the guest sufficiently.
- Early TM emulation for the softpatch interrupt should be done if fake
suspend mode is _not_ active.
- Early TM emulation wants to return immediately to the guest so as to
not doom transactions unnecessarily.
- And if exiting from the guest, the host MSR should include the TM[S]
bit if the guest was T/S, before it is treclaimed.
After this fix, all the TM selftests pass when running on a P9 processor
that implements TM with softpatch interrupt.
Fixes:
89d35b2391015 ("KVM: PPC: Book3S HV P9: Implement the rest of the P9 path in C")
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210712013650.376325-1-npiggin@gmail.com
Suravee Suthikulpanit [Wed, 14 Jul 2021 21:02:22 +0000 (04:02 +0700)]
MAINTAINERS: Add Suravee Suthikulpanit as Reviewer for AMD IOMMU (AMD-Vi)
To help review changes related to AMD IOMMU.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/1626296542-30454-1-git-send-email-suravee.suthikulpanit@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Jinzhou Su [Tue, 13 Jul 2021 01:26:11 +0000 (09:26 +0800)]
drm/amdgpu: add another Renoir DID
Add new PCI device id.
Signed-off-by: Jinzhou Su <Jinzhou.Su@amd.com>
Reviewed-by: Huang Rui <ray.huang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 5.11.x
James Clark [Thu, 24 Jun 2021 16:43:02 +0000 (17:43 +0100)]
perf cs-etm: Split Coresight decode by aux records
Populate the auxtrace queues using AUX records rather than whole
auxtrace buffers so that the decoder is reset between each aux record.
This is similar to the auxtrace_queues__process_index() ->
auxtrace_queues__add_indexed_event() flow where
perf_session__peek_event() is used to read AUXTRACE events out of random
positions in the file based on the auxtrace index.
But now we loop over all PERF_RECORD_AUX events instead of AUXTRACE
buffers. For each PERF_RECORD_AUX event, we find the corresponding
AUXTRACE buffer using the index, and add a fragment of that buffer to
the auxtrace queues.
No other changes to decoding were made, apart from populating the
auxtrace queues. The result of decoding is identical to before, except
in cases where decoding failed completely, due to not resetting the
decoder.
The reason for this change is because AUX records are emitted any time
tracing is disabled, for example when the process is scheduled out.
Because ETM was disabled and enabled again, the decoder also needs to be
reset to force the search for a sync packet. Otherwise there would be
fatal decoding errors.
Testing
=======
Testing was done with the following script, to diff the decoding results
between the patched and un-patched versions of perf:
#!/bin/bash
set -ex
$1 script -i $3 $4 > split.script
$2 script -i $3 $4 > default.script
diff split.script default.script | head -n 20
And it was run like this, with various itrace options depending on the
quantity of synthesised events:
compare.sh ./perf-patched ./perf-default perf-per-cpu-2-threads.data --itrace=i100000ns
No changes in output were observed in the following scenarios:
* Simple per-cpu
perf record -e cs_etm/@tmc_etr0/u top
* Per-thread, single thread
perf record -e cs_etm/@tmc_etr0/u --per-thread ./threads_C
* Per-thread multiple threads (but only one thread collected data):
perf record -e cs_etm/@tmc_etr0/u --per-thread --pid 4596,4597
* Per-thread multiple threads (both threads collected data):
perf record -e cs_etm/@tmc_etr0/u --per-thread --pid 4596,4597
* Per-cpu explicit threads:
perf record -e cs_etm/@tmc_etr0/u --pid 853,854
* System-wide (per-cpu):
perf record -e cs_etm/@tmc_etr0/u -a
* No data collected (no aux buffers)
Can happen with any command when run for a short period
* Containing truncated records
Can happen with any command
* Containing aux records with 0 size
Can happen with any command
* Snapshot mode (various files with and without buffer wrap)
perf record -e cs_etm/@tmc_etr0/u -a --snapshot
Some differences were observed in the following scenario:
* Snapshot mode (with duplicate buffers)
perf record -e cs_etm/@tmc_etr0/u -a --snapshot
Fewer samples are generated in snapshot mode if duplicate buffers
were gathered because buffers with the same offset are now only added
once. This gives different, but more correct results and no duplicate
data is decoded any more.
Signed-off-by: James Clark <james.clark@arm.com>
Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Tested-by: Leo Yan <leo.yan@linaro.org>
Cc: Al Grant <al.grant@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Branislav Rankov <branislav.rankov@arm.com>
Cc: Denis Nikitin <denik@chromium.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mike Leach <mike.leach@linaro.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Suzuki Poulouse <suzuki.poulose@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: coresight@lists.linaro.org
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lore.kernel.org/lkml/20210624164303.28632-2-james.clark@arm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Arnaldo Carvalho de Melo [Wed, 14 Jul 2021 17:28:02 +0000 (14:28 -0300)]
tools headers: Remove broken definition of __LITTLE_ENDIAN
The linux/kconfig.h file was copied from the kernel but the line where
with the generated/autoconf.h include from where the CONFIG_ entries
would come from was deleted, as tools/ build system don't create that
file, so we ended up always defining just __LITTLE_ENDIAN as
CONFIG_CPU_BIG_ENDIAN was nowhere to be found.
This in turn ended up breaking the build in some systems where
__LITTLE_ENDIAN was already defined, such as the androind NDK.
So just ditch that block that depends on the CONFIG_CPU_BIG_ENDIAN
define.
The kconfig.h file was copied just to get IS_ENABLED() and a
'make -C tools/all' doesn't breaks with this removal.
Fixes:
93281c4a96572a34 ("x86/insn: Add an insn_decode() API")
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lore.kernel.org/lkml/YO8hK7lqJcIWuBzx@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Linus Torvalds [Wed, 14 Jul 2021 16:24:32 +0000 (09:24 -0700)]
Merge tag 'net-5.14-rc2' of git://git./linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski.
"Including fixes from bpf and netfilter.
Current release - regressions:
- sock: fix parameter order in sock_setsockopt()
Current release - new code bugs:
- netfilter: nft_last:
- fix incorrect arithmetic when restoring last used
- honor NFTA_LAST_SET on restoration
Previous releases - regressions:
- udp: properly flush normal packet at GRO time
- sfc: ensure correct number of XDP queues; don't allow enabling the
feature if there isn't sufficient resources to Tx from any CPU
- dsa: sja1105: fix address learning getting disabled on the CPU port
- mptcp: addresses a rmem accounting issue that could keep packets in
subflow receive buffers longer than necessary, delaying MPTCP-level
ACKs
- ip_tunnel: fix mtu calculation for ETHER tunnel devices
- do not reuse skbs allocated from skbuff_fclone_cache in the napi
skb cache, we'd try to return them to the wrong slab cache
- tcp: consistently disable header prediction for mptcp
Previous releases - always broken:
- bpf: fix subprog poke descriptor tracking use-after-free
- ipv6:
- allocate enough headroom in ip6_finish_output2() in case
iptables TEE is used
- tcp: drop silly ICMPv6 packet too big messages to avoid
expensive and pointless lookups (which may serve as a DDOS
vector)
- make sure fwmark is copied in SYNACK packets
- fix 'disable_policy' for forwarded packets (align with IPv4)
- netfilter: conntrack:
- do not renew entry stuck in tcp SYN_SENT state
- do not mark RST in the reply direction coming after SYN packet
for an out-of-sync entry
- mptcp: cleanly handle error conditions with MP_JOIN and syncookies
- mptcp: fix double free when rejecting a join due to port mismatch
- validate lwtstate->data before returning from skb_tunnel_info()
- tcp: call sk_wmem_schedule before sk_mem_charge in zerocopy path
- mt76: mt7921: continue to probe driver when fw already downloaded
- bonding: fix multiple issues with offloading IPsec to (thru?) bond
- stmmac: ptp: fix issues around Qbv support and setting time back
- bcmgenet: always clear wake-up based on energy detection
Misc:
- sctp: move 198 addresses from unusable to private scope
- ptp: support virtual clocks and timestamping
- openvswitch: optimize operation for key comparison"
* tag 'net-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (158 commits)
net: dsa: properly check for the bridge_leave methods in dsa_switch_bridge_leave()
sfc: add logs explaining XDP_TX/REDIRECT is not available
sfc: ensure correct number of XDP queues
sfc: fix lack of XDP TX queues - error XDP TX failed (-22)
net: fddi: fix UAF in fza_probe
net: dsa: sja1105: fix address learning getting disabled on the CPU port
net: ocelot: fix switchdev objects synced for wrong netdev with LAG offload
net: Use nlmsg_unicast() instead of netlink_unicast()
octeontx2-pf: Fix uninitialized boolean variable pps
ipv6: allocate enough headroom in ip6_finish_output2()
net: hdlc: rename 'mod_init' & 'mod_exit' functions to be module-specific
net: bridge: multicast: fix MRD advertisement router port marking race
net: bridge: multicast: fix PIM hello router port marking race
net: phy: marvell10g: fix differentiation of 88X3310 from 88X3340
dsa: fix for_each_child.cocci warnings
virtio_net: check virtqueue_add_sgs() return value
mptcp: properly account bulk freed memory
selftests: mptcp: fix case multiple subflows limited by server
mptcp: avoid processing packet if a subflow reset
mptcp: fix syncookie process if mptcp can not_accept new subflow
...
Christian Brauner [Wed, 14 Jul 2021 13:47:50 +0000 (15:47 +0200)]
fs: add vfs_parse_fs_param_source() helper
Add a simple helper that filesystems can use in their parameter parser
to parse the "source" parameter. A few places open-coded this function
and that already caused a bug in the cgroup v1 parser that we fixed.
Let's make it harder to get this wrong by introducing a helper which
performs all necessary checks.
Link: https://syzkaller.appspot.com/bug?id=6312526aba5beae046fdae8f00399f87aab48b12
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Christian Brauner [Wed, 14 Jul 2021 13:47:49 +0000 (15:47 +0200)]
cgroup: verify that source is a string
The following sequence can be used to trigger a UAF:
int fscontext_fd = fsopen("cgroup");
int fd_null = open("/dev/null, O_RDONLY);
int fsconfig(fscontext_fd, FSCONFIG_SET_FD, "source", fd_null);
close_range(3, ~0U, 0);
The cgroup v1 specific fs parser expects a string for the "source"
parameter. However, it is perfectly legitimate to e.g. specify a file
descriptor for the "source" parameter. The fs parser doesn't know what
a filesystem allows there. So it's a bug to assume that "source" is
always of type fs_value_is_string when it can reasonably also be
fs_value_is_file.
This assumption in the cgroup code causes a UAF because struct
fs_parameter uses a union for the actual value. Access to that union is
guarded by the param->type member. Since the cgroup paramter parser
didn't check param->type but unconditionally moved param->string into
fc->source a close on the fscontext_fd would trigger a UAF during
put_fs_context() which frees fc->source thereby freeing the file stashed
in param->file causing a UAF during a close of the fd_null.
Fix this by verifying that param->type is actually a string and report
an error if not.
In follow up patches I'll add a new generic helper that can be used here
and by other filesystems instead of this error-prone copy-pasta fix.
But fixing it in here first makes backporting a it to stable a lot
easier.
Fixes:
8d2451f4994f ("cgroup1: switch to option-by-option parsing")
Reported-by: syzbot+283ce5a46486d6acdbaf@syzkaller.appspotmail.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@kernel.org>
Cc: syzkaller-bugs <syzkaller-bugs@googlegroups.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Like Xu [Mon, 28 Jun 2021 07:43:54 +0000 (15:43 +0800)]
KVM: x86/pmu: Clear anythread deprecated bit when 0xa leaf is unsupported on the SVM
The AMD platform does not support the functions Ah CPUID leaf. The returned
results for this entry should all remain zero just like the native does:
AMD host:
0x0000000a 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
(uncanny) AMD guest:
0x0000000a 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00008000
Fixes:
cadbaa039b99 ("perf/x86/intel: Make anythread filter support conditional")
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <
20210628074354.33848-1-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Kefeng Wang [Sat, 26 Jun 2021 07:03:04 +0000 (15:03 +0800)]
KVM: mmio: Fix use-after-free Read in kvm_vm_ioctl_unregister_coalesced_mmio
BUG: KASAN: use-after-free in kvm_vm_ioctl_unregister_coalesced_mmio+0x7c/0x1ec arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:183
Read of size 8 at addr
ffff0000c03a2500 by task syz-executor083/4269
CPU: 5 PID: 4269 Comm: syz-executor083 Not tainted 5.10.0 #7
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x2d0 arch/arm64/kernel/stacktrace.c:132
show_stack+0x28/0x34 arch/arm64/kernel/stacktrace.c:196
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x110/0x164 lib/dump_stack.c:118
print_address_description+0x78/0x5c8 mm/kasan/report.c:385
__kasan_report mm/kasan/report.c:545 [inline]
kasan_report+0x148/0x1e4 mm/kasan/report.c:562
check_memory_region_inline mm/kasan/generic.c:183 [inline]
__asan_load8+0xb4/0xbc mm/kasan/generic.c:252
kvm_vm_ioctl_unregister_coalesced_mmio+0x7c/0x1ec arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:183
kvm_vm_ioctl+0xe30/0x14c4 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:3755
vfs_ioctl fs/ioctl.c:48 [inline]
__do_sys_ioctl fs/ioctl.c:753 [inline]
__se_sys_ioctl fs/ioctl.c:739 [inline]
__arm64_sys_ioctl+0xf88/0x131c fs/ioctl.c:739
__invoke_syscall arch/arm64/kernel/syscall.c:36 [inline]
invoke_syscall arch/arm64/kernel/syscall.c:48 [inline]
el0_svc_common arch/arm64/kernel/syscall.c:158 [inline]
do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:220
el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367
el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383
el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670
Allocated by task 4269:
stack_trace_save+0x80/0xb8 kernel/stacktrace.c:121
kasan_save_stack mm/kasan/common.c:48 [inline]
kasan_set_track mm/kasan/common.c:56 [inline]
__kasan_kmalloc+0xdc/0x120 mm/kasan/common.c:461
kasan_kmalloc+0xc/0x14 mm/kasan/common.c:475
kmem_cache_alloc_trace include/linux/slab.h:450 [inline]
kmalloc include/linux/slab.h:552 [inline]
kzalloc include/linux/slab.h:664 [inline]
kvm_vm_ioctl_register_coalesced_mmio+0x78/0x1cc arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:146
kvm_vm_ioctl+0x7e8/0x14c4 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:3746
vfs_ioctl fs/ioctl.c:48 [inline]
__do_sys_ioctl fs/ioctl.c:753 [inline]
__se_sys_ioctl fs/ioctl.c:739 [inline]
__arm64_sys_ioctl+0xf88/0x131c fs/ioctl.c:739
__invoke_syscall arch/arm64/kernel/syscall.c:36 [inline]
invoke_syscall arch/arm64/kernel/syscall.c:48 [inline]
el0_svc_common arch/arm64/kernel/syscall.c:158 [inline]
do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:220
el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367
el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383
el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670
Freed by task 4269:
stack_trace_save+0x80/0xb8 kernel/stacktrace.c:121
kasan_save_stack mm/kasan/common.c:48 [inline]
kasan_set_track+0x38/0x6c mm/kasan/common.c:56
kasan_set_free_info+0x20/0x40 mm/kasan/generic.c:355
__kasan_slab_free+0x124/0x150 mm/kasan/common.c:422
kasan_slab_free+0x10/0x1c mm/kasan/common.c:431
slab_free_hook mm/slub.c:1544 [inline]
slab_free_freelist_hook mm/slub.c:1577 [inline]
slab_free mm/slub.c:3142 [inline]
kfree+0x104/0x38c mm/slub.c:4124
coalesced_mmio_destructor+0x94/0xa4 arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:102
kvm_iodevice_destructor include/kvm/iodev.h:61 [inline]
kvm_io_bus_unregister_dev+0x248/0x280 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:4374
kvm_vm_ioctl_unregister_coalesced_mmio+0x158/0x1ec arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:186
kvm_vm_ioctl+0xe30/0x14c4 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:3755
vfs_ioctl fs/ioctl.c:48 [inline]
__do_sys_ioctl fs/ioctl.c:753 [inline]
__se_sys_ioctl fs/ioctl.c:739 [inline]
__arm64_sys_ioctl+0xf88/0x131c fs/ioctl.c:739
__invoke_syscall arch/arm64/kernel/syscall.c:36 [inline]
invoke_syscall arch/arm64/kernel/syscall.c:48 [inline]
el0_svc_common arch/arm64/kernel/syscall.c:158 [inline]
do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:220
el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367
el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383
el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670
If kvm_io_bus_unregister_dev() return -ENOMEM, we already call kvm_iodevice_destructor()
inside this function to delete 'struct kvm_coalesced_mmio_dev *dev' from list
and free the dev, but kvm_iodevice_destructor() is called again, it will lead
the above issue.
Let's check the the return value of kvm_io_bus_unregister_dev(), only call
kvm_iodevice_destructor() if the return value is 0.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Message-Id: <
20210626070304.143456-1-wangkefeng.wang@huawei.com>
Cc: stable@vger.kernel.org
Fixes:
5d3c4c79384a ("KVM: Stop looking for coalesced MMIO zones if the bus is destroyed", 2021-04-20)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sean Christopherson [Fri, 25 Jun 2021 02:03:54 +0000 (19:03 -0700)]
KVM: SVM: Revert clearing of C-bit on GPA in #NPF handler
Don't clear the C-bit in the #NPF handler, as it is a legal GPA bit for
non-SEV guests, and for SEV guests the C-bit is dropped before the GPA
hits the NPT in hardware. Clearing the bit for non-SEV guests causes KVM
to mishandle #NPFs with that collide with the host's C-bit.
Although the APM doesn't explicitly state that the C-bit is not reserved
for non-SEV, Tom Lendacky confirmed that the following snippet about the
effective reduction due to the C-bit does indeed apply only to SEV guests.
Note that because guest physical addresses are always translated
through the nested page tables, the size of the guest physical address
space is not impacted by any physical address space reduction indicated
in CPUID 8000_001F[EBX]. If the C-bit is a physical address bit however,
the guest physical address space is effectively reduced by 1 bit.
And for SEV guests, the APM clearly states that the bit is dropped before
walking the nested page tables.
If the C-bit is an address bit, this bit is masked from the guest
physical address when it is translated through the nested page tables.
Consequently, the hypervisor does not need to be aware of which pages
the guest has chosen to mark private.
Note, the bogus C-bit clearing was removed from legacy #PF handler in
commit
6d1b867d0456 ("KVM: SVM: Don't strip the C-bit from CR2 on #PF
interception").
Fixes:
0ede79e13224 ("KVM: SVM: Clear C-bit from the page fault address")
Cc: Peter Gonda <pgonda@google.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <
20210625020354.431829-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sean Christopherson [Wed, 23 Jun 2021 23:05:49 +0000 (16:05 -0700)]
KVM: x86/mmu: Do not apply HPA (memory encryption) mask to GPAs
Ignore "dynamic" host adjustments to the physical address mask when
generating the masks for guest PTEs, i.e. the guest PA masks. The host
physical address space and guest physical address space are two different
beasts, e.g. even though SEV's C-bit is the same bit location for both
host and guest, disabling SME in the host (which clears shadow_me_mask)
does not affect the guest PTE->GPA "translation".
For non-SEV guests, not dropping bits is the correct behavior. Assuming
KVM and userspace correctly enumerate/configure guest MAXPHYADDR, bits
that are lost as collateral damage from memory encryption are treated as
reserved bits, i.e. KVM will never get to the point where it attempts to
generate a gfn using the affected bits. And if userspace wants to create
a bogus vCPU, then userspace gets to deal with the fallout of hardware
doing odd things with bad GPAs.
For SEV guests, not dropping the C-bit is technically wrong, but it's a
moot point because KVM can't read SEV guest's page tables in any case
since they're always encrypted. Not to mention that the current KVM code
is also broken since sme_me_mask does not have to be non-zero for SEV to
be supported by KVM. The proper fix would be to teach all of KVM to
correctly handle guest private memory, but that's a task for the future.
Fixes:
d0ec49d4de90 ("kvm/x86/svm: Support Secure Memory Encryption within KVM")
Cc: stable@vger.kernel.org
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <
20210623230552.4027702-5-seanjc@google.com>
[Use a new header instead of adding header guards to paging_tmpl.h. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sean Christopherson [Wed, 23 Jun 2021 23:05:47 +0000 (16:05 -0700)]
KVM: x86: Use kernel's x86_phys_bits to handle reduced MAXPHYADDR
Use boot_cpu_data.x86_phys_bits instead of the raw CPUID information to
enumerate the MAXPHYADDR for KVM guests when TDP is disabled (the guest
version is only relevant to NPT/TDP).
When using shadow paging, any reductions to the host's MAXPHYADDR apply
to KVM and its guests as well, i.e. using the raw CPUID info will cause
KVM to misreport the number of PA bits available to the guest.
Unconditionally zero out the "Physical Address bit reduction" entry.
For !TDP, the adjustment is already done, and for TDP enumerating the
host's reduction is wrong as the reduction does not apply to GPAs.
Fixes:
9af9b94068fb ("x86/cpu/AMD: Handle SME reduction in physical address size")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <
20210623230552.4027702-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sean Christopherson [Wed, 23 Jun 2021 23:05:46 +0000 (16:05 -0700)]
KVM: x86: Use guest MAXPHYADDR from CPUID.0x8000_0008 iff TDP is enabled
Ignore the guest MAXPHYADDR reported by CPUID.0x8000_0008 if TDP, i.e.
NPT, is disabled, and instead use the host's MAXPHYADDR. Per AMD'S APM:
Maximum guest physical address size in bits. This number applies only
to guests using nested paging. When this field is zero, refer to the
PhysAddrSize field for the maximum guest physical address size.
Fixes:
24c82e576b78 ("KVM: Sanitize cpuid")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <
20210623230552.4027702-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sean Christopherson [Fri, 25 Jun 2021 00:18:53 +0000 (17:18 -0700)]
Revert "KVM: x86: WARN and reject loading KVM if NX is supported but not enabled"
Let KVM load if EFER.NX=0 even if NX is supported, the analysis and
testing (or lack thereof) for the non-PAE host case was garbage.
If the kernel won't be using PAE paging, .Ldefault_entry in head_32.S
skips over the entire EFER sequence. Hopefully that can be changed in
the future to allow KVM to require EFER.NX, but the motivation behind
KVM's requirement isn't yet merged. Reverting and revisiting the mess
at a later date is by far the safest approach.
This reverts commit
8bbed95d2cb6e5de8a342d761a89b0a04faed7be.
Fixes:
8bbed95d2cb6 ("KVM: x86: WARN and reject loading KVM if NX is supported but not enabled")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <
20210625001853.318148-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Marc Zyngier [Thu, 1 Jul 2021 07:19:28 +0000 (08:19 +0100)]
KVM: selftests: x86: Address missing vm_install_exception_handler conversions
Commit
b78f4a59669 ("KVM: selftests: Rename vm_handle_exception")
raced with a couple of new x86 tests, missing two vm_handle_exception
to vm_install_exception_handler conversions.
Help the two broken tests to catch up with the new world.
Cc: Andrew Jones <drjones@redhat.com>
CC: Ricardo Koller <ricarkol@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Message-Id: <
20210701071928.2971053-1-maz@kernel.org>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 8 Jul 2021 17:15:57 +0000 (13:15 -0400)]
Merge tag 'kvm-s390-master-5.14-1' of git://git./linux/kernel/git/kvms390/linux into HEAD
KVM: selftests: Fixes
- provide memory model for IBM z196 and zEC12
- do not require 64GB of memory
Gustavo A. R. Silva [Mon, 12 Jul 2021 05:57:54 +0000 (00:57 -0500)]
Makefile: Enable -Wimplicit-fallthrough for Clang
With the recent fixes for fallthrough warnings, it is now possible to
enable -Wimplicit-fallthrough for Clang.
It's important to mention that since we have adopted the use of the
pseudo-keyword macro fallthrough; we also want to avoid having more
/* fall through */ comments being introduced. Notice that contrary
to GCC, Clang doesn't recognize any comments as implicit fall-through
markings when the -Wimplicit-fallthrough option is enabled. So, in
order to avoid having more comments being introduced, we have to use
the option -Wimplicit-fallthrough=5 for GCC, which similar to Clang,
will cause a warning in case a code comment is intended to be used
as a fall-through marking.
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Gustavo A. R. Silva [Wed, 14 Jul 2021 16:10:40 +0000 (11:10 -0500)]
powerpc/smp: Fix fall-through warning for Clang
Fix the following fallthrough warning:
arch/powerpc/platforms/powermac/smp.c:149:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/lkml/60ef0750.I8J+C6KAtb0xVOAa%25lkp@intel.com/
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Arnaldo Carvalho de Melo [Wed, 14 Jul 2021 16:06:38 +0000 (13:06 -0300)]
perf sched: Cast PTHREAD_STACK_MIN to int as it may turn into sysconf(__SC_THREAD_STACK_MIN_VALUE)
In fedora rawhide the PTHREAD_STACK_MIN define may end up expanded to a
sysconf() call, and that will return 'long int', breaking the build:
45 fedora:rawhide : FAIL gcc version 11.1.1
20210623 (Red Hat 11.1.1-6) (GCC)
builtin-sched.c: In function 'create_tasks':
/git/perf-5.14.0-rc1/tools/include/linux/kernel.h:43:24: error: comparison of distinct pointer types lacks a cast [-Werror]
43 | (void) (&_max1 == &_max2); \
| ^~
builtin-sched.c:673:34: note: in expansion of macro 'max'
673 | (size_t) max(16 * 1024, PTHREAD_STACK_MIN));
| ^~~
cc1: all warnings being treated as errors
$ grep __sysconf /usr/include/*/*.h
/usr/include/bits/pthread_stack_min-dynamic.h:extern long int __sysconf (int __name) __THROW;
/usr/include/bits/pthread_stack_min-dynamic.h:# define PTHREAD_STACK_MIN __sysconf (__SC_THREAD_STACK_MIN_VALUE)
/usr/include/bits/time.h:extern long int __sysconf (int);
/usr/include/bits/time.h:# define CLK_TCK ((__clock_t) __sysconf (2)) /* 2 is _SC_CLK_TCK */
$
So cast it to int to cope with that.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Gustavo A. R. Silva [Wed, 14 Jul 2021 16:05:55 +0000 (11:05 -0500)]
dmaengine: mpc512x: Fix fall-through warning for Clang
Fix the following fallthrough warning (powerpc-randconfig):
drivers/dma/mpc512x_dma.c:816:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/lkml/60ef0750.I8J+C6KAtb0xVOAa%25lkp@intel.com/
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Gustavo A. R. Silva [Wed, 14 Jul 2021 16:02:37 +0000 (11:02 -0500)]
usb: gadget: fsl_qe_udc: Fix fall-through warning for Clang
Fix the following fallthrough warning (powerpc-randconfig):
drivers/usb/gadget/udc/fsl_qe_udc.c:589:4: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/lkml/60ef0750.I8J+C6KAtb0xVOAa%25lkp@intel.com/
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Zheyu Ma [Wed, 14 Jul 2021 14:54:19 +0000 (14:54 +0000)]
drm/ttm: add a check against null pointer dereference
When calling ttm_range_man_fini(), 'man' may be uninitialized, which may
cause a null pointer dereference bug.
Fix this by checking if it is a null pointer.
This log reveals it:
[ 7.902580 ] BUG: kernel NULL pointer dereference, address:
0000000000000058
[ 7.905721 ] RIP: 0010:ttm_range_man_fini+0x40/0x160
[ 7.911826 ] Call Trace:
[ 7.911826 ] radeon_ttm_fini+0x167/0x210
[ 7.911826 ] radeon_bo_fini+0x15/0x40
[ 7.913767 ] rs400_fini+0x55/0x80
[ 7.914358 ] radeon_device_fini+0x3c/0x140
[ 7.914358 ] radeon_driver_unload_kms+0x5c/0xe0
[ 7.914358 ] radeon_driver_load_kms+0x13a/0x200
[ 7.914358 ] ? radeon_driver_unload_kms+0xe0/0xe0
[ 7.914358 ] drm_dev_register+0x1db/0x290
[ 7.914358 ] radeon_pci_probe+0x16a/0x230
[ 7.914358 ] local_pci_probe+0x4a/0xb0
Signed-off-by: Zheyu Ma <zheyuma97@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/1626274459-8148-1-git-send-email-zheyuma97@gmail.com
Signed-off-by: Christian König <christian.koenig@amd.com>
Hyunchul Lee [Mon, 12 Jul 2021 10:34:02 +0000 (19:34 +0900)]
cifs: fix the out of range assignment to bit fields in parse_server_interfaces
Because the out of range assignment to bit fields
are compiler-dependant, the fields could have wrong
value.
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Ronnie Sahlberg [Tue, 13 Jul 2021 02:22:59 +0000 (12:22 +1000)]
cifs: Do not use the original cruid when following DFS links for multiuser mounts
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=213565
cruid should only be used for the initial mount and after this we should use the current
users credentials.
Ignore the original cruid mount argument when creating a new context for a multiuser mount
following a DFS link.
Fixes:
24e0a1eff9e2 ("cifs: switch to new mount api")
Cc: stable@vger.kernel.org # 5.11+
Reported-by: Xiaoli Feng <xifeng@redhat.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
Shyam Prasad N [Tue, 18 May 2021 15:05:50 +0000 (15:05 +0000)]
cifs: use the expiry output of dns_query to schedule next resolution
We recently fixed DNS resolution of the server hostname during reconnect.
However, server IP address may change, even when the old one continues
to server (although sub-optimally).
We should schedule the next DNS resolution based on the TTL of
the DNS record used for the last resolution. This way, we resolve the
server hostname again when a DNS record expires.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Cc: <stable@vger.kernel.org> # v5.11+
Signed-off-by: Steve French <stfrench@microsoft.com>
Heiko Carstens [Tue, 13 Jul 2021 09:19:07 +0000 (11:19 +0200)]
libperf: Fix build error with LIBPFM4=1
Fix build error with LIBPFM4=1:
CC util/pfm.o
util/pfm.c: In function ‘parse_libpfm_events_option’:
util/pfm.c:102:30: error: ‘struct evsel’ has no member named ‘leader’
102 | evsel->leader = grp_leader;
| ^~
Committer notes:
There is this entry in 'make -C tools/perf build-test' to test the build
with libpfm:
$ grep libpfm tools/perf/tests/make
make_with_libpfm4 := LIBPFM4=1
run += make_with_libpfm4
$
But the test machine lacked libpfm-devel, now its installed and further
cases like this shouldn't happen.
Committer testing:
Before this patch this fails, after applying it:
$ make -C tools/perf build-test
make: Entering directory '/var/home/acme/git/perf/tools/perf'
- tarpkg: ./tests/perf-targz-src-pkg .
make_static: make LDFLAGS=-static NO_PERF_READ_VDSO32=1 NO_PERF_READ_VDSOX32=1 NO_JVMTI=1 -j24 DESTDIR=/tmp/tmp.KzFSfvGRQa
<SNIP>
make_no_scripts_O: make NO_LIBPYTHON=1 NO_LIBPERL=1
make_with_libpfm4_O: make LIBPFM4=1
make_install_prefix_O: make install prefix=/tmp/krava
make_no_auxtrace_O: make NO_AUXTRACE=1
<SNIP>
$ rpm -q libpfm-devel
libpfm-devel-4.11.0-4.fc34.x86_64
$
FIXME:
This shows a need for 'build-test' to bail out when a build option is
specified that has no required library devel files installed.
Fixes:
fba7c86601e2e42d ("libperf: Move 'leader' from tools/perf to perf_evsel::leader")
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/20210713091907.1555560-1-hca@linux.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Arnaldo Carvalho de Melo [Sun, 9 May 2021 12:55:30 +0000 (09:55 -0300)]
tools headers UAPI: Sync files changed by the memfd_secret new syscall
To pick the changes in this cset:
7bb7f2ac24a028b2 ("arch, mm: wire up memfd_secret system call where relevant")
That silences these perf build warnings and add support for those new
syscalls in tools such as 'perf trace'.
For instance, this is now possible:
# perf trace -v -e memfd_secret
event qualifier tracepoint filter: (common_pid != 13375 && common_pid != 3713) && (id == 447)
^C#
That is the filter expression attached to the raw_syscalls:sys_{enter,exit}
tracepoints.
$ grep memfd_secret tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
447 common memfd_secret sys_memfd_secret
$
This addresses these perf build warnings:
Warning: Kernel ABI header at 'tools/arch/arm64/include/uapi/asm/unistd.h' differs from latest version at 'arch/arm64/include/uapi/asm/unistd.h'
diff -u tools/arch/arm64/include/uapi/asm/unistd.h arch/arm64/include/uapi/asm/unistd.h
Warning: Kernel ABI header at 'tools/include/uapi/asm-generic/unistd.h' differs from latest version at 'include/uapi/asm-generic/unistd.h'
diff -u tools/include/uapi/asm-generic/unistd.h include/uapi/asm-generic/unistd.h
Warning: Kernel ABI header at 'tools/perf/arch/x86/entry/syscalls/syscall_64.tbl' differs from latest version at 'arch/x86/entry/syscalls/syscall_64.tbl'
diff -u tools/perf/arch/x86/entry/syscalls/syscall_64.tbl arch/x86/entry/syscalls/syscall_64.tbl
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jin Yao [Wed, 7 Jul 2021 05:56:52 +0000 (13:56 +0800)]
perf stat: Merge uncore events by default for hybrid platform
On a hybrid platform, by default 'perf stat' aggregates and reports the
event counts per PMU. For example,
# perf stat -e cycles -a true
Performance counter stats for 'system wide':
1,400,445 cpu_core/cycles/
680,881 cpu_atom/cycles/
0.
001770773 seconds time elapsed
But for uncore events that's not a suitable method. Uncore has nothing
to do with hybrid. So for uncore events, we aggregate event counts from
all PMUs and report the counts without PMUs.
Before:
# perf stat -e arb/event=0x81,umask=0x1/,arb/event=0x84,umask=0x1/ -a true
Performance counter stats for 'system wide':
2,058 uncore_arb_0/event=0x81,umask=0x1/
2,028 uncore_arb_1/event=0x81,umask=0x1/
0 uncore_arb_0/event=0x84,umask=0x1/
0 uncore_arb_1/event=0x84,umask=0x1/
0.
000614498 seconds time elapsed
After:
# perf stat -e arb/event=0x81,umask=0x1/,arb/event=0x84,umask=0x1/ -a true
Performance counter stats for 'system wide':
3,996 arb/event=0x81,umask=0x1/
0 arb/event=0x84,umask=0x1/
0.
000630046 seconds time elapsed
Of course, we also keep the '--no-merge' working for uncore events.
# perf stat -e arb/event=0x81,umask=0x1/,arb/event=0x84,umask=0x1/ --no-merge true
Performance counter stats for 'system wide':
1,952 uncore_arb_0/event=0x81,umask=0x1/
1,921 uncore_arb_1/event=0x81,umask=0x1/
0 uncore_arb_0/event=0x84,umask=0x1/
0 uncore_arb_1/event=0x84,umask=0x1/
0.
000575536 seconds time elapsed
Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210707055652.962-1-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jin Yao [Thu, 8 Jul 2021 01:37:01 +0000 (09:37 +0800)]
perf tests: Fix 'Convert perf time to TSC' on core-only system
If the atom CPUs are offlined, the 'cpu_atom' is not valid.
We don't need the test case for 'cpu_atom'.
Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jin Yao <yao.jin@intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/20210708013701.20347-5-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jin Yao [Thu, 8 Jul 2021 01:37:00 +0000 (09:37 +0800)]
perf tests: Fix 'Roundtrip evsel->name' on core-only system
If the atom CPUs are offlined, the 'cpu_atom' is not valid.
Perf will not create two events for one hw event, so the
evsel->idx doesn't need to be divided by 2 before comparing.
Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jin Yao <yao.jin@intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/20210708013701.20347-4-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jin Yao [Thu, 8 Jul 2021 01:36:59 +0000 (09:36 +0800)]
perf tests: Fix 'Parse event definition strings' on core-only system
If the atom CPUs are offlined, the 'cpu_atom' is not valid.
We don't need the test case for 'cpu_atom'.
Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jin Yao <yao.jin@intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/20210708013701.20347-3-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jin Yao [Thu, 8 Jul 2021 01:36:58 +0000 (09:36 +0800)]
perf pmu: Skip invalid hybrid pmu
On hybrid platform, such as Alderlake, if atom CPUs are offlined,
the kernel still exports the sysfs path '/sys/devices/cpu_atom/' for
'cpu_atom' pmu but the file '/sys/devices/cpu_atom/cpus' is empty,
which indicates this is an invalid pmu.
Need to check and skip the invalid hybrid pmu.
Before:
# perf list
...
branch-instructions OR cpu_atom/branch-instructions/ [Kernel PMU event]
branch-instructions OR cpu_core/branch-instructions/ [Kernel PMU event]
branch-misses OR cpu_atom/branch-misses/ [Kernel PMU event]
branch-misses OR cpu_core/branch-misses/ [Kernel PMU event]
bus-cycles OR cpu_atom/bus-cycles/ [Kernel PMU event]
bus-cycles OR cpu_core/bus-cycles/ [Kernel PMU event]
...
The cpu_atom events are still displayed even if atom CPUs are offlined.
After:
# perf list
...
branch-instructions OR cpu_core/branch-instructions/ [Kernel PMU event]
branch-misses OR cpu_core/branch-misses/ [Kernel PMU event]
bus-cycles OR cpu_core/bus-cycles/ [Kernel PMU event]
...
Now only cpu_core events are displayed.
Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jin Yao <yao.jin@intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/20210708013701.20347-2-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Matthew Auld [Tue, 13 Jul 2021 13:04:31 +0000 (14:04 +0100)]
drm/i915/gtt: drop the page table optimisation
We skip filling out the pt with scratch entries if the va range covers
the entire pt, since we later have to fill it with the PTEs for the
object pages anyway. However this might leave open a small window where
the PTEs don't point to anything valid for the HW to consume.
When for example using 2M GTT pages this fill_px() showed up as being
quite significant in perf measurements, and ends up being completely
wasted since we ignore the pt and just use the pde directly.
Anyway, currently we have our PTE construction split between alloc and
insert, which is probably slightly iffy nowadays, since the alloc
doesn't actually allocate anything anymore, instead it just sets up the
page directories and points the PTEs at the scratch page. Later when we
do the insert step we re-program the PTEs again. Better might be to
squash the alloc and insert into a single step, then bringing back this
optimisation(along with some others) should be possible.
Fixes:
14826673247e ("drm/i915: Only initialize partially filled pagetables")
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Jon Bloomfield <jon.bloomfield@intel.com>
Cc: Chris Wilson <chris.p.wilson@intel.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: <stable@vger.kernel.org> # v4.15+
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20210713130431.2392740-1-matthew.auld@intel.com
(cherry picked from commit
8f88ca76b3942d82e2c1cea8735ec368d89ecc15)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Benjamin Gaignard [Mon, 12 Jul 2021 10:12:32 +0000 (12:12 +0200)]
iommu/rockchip: Fix physical address decoding
Restore bits 39 to 32 at correct position.
It reverses the operation done in rk_dma_addr_dte_v2().
Fixes:
c55356c534aa ("iommu: rockchip: Add support for iommu v2")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Benjamin Gaignard <benjamin.gaignard@collabora.com>
Link: https://lore.kernel.org/r/20210712101232.318589-1-benjamin.gaignard@collabora.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Lu Baolu [Mon, 12 Jul 2021 07:17:12 +0000 (15:17 +0800)]
iommu/vt-d: Fix clearing real DMA device's scalable-mode context entries
The commit
2b0140c69637e ("iommu/vt-d: Use pci_real_dma_dev() for mapping")
fixes an issue of "sub-device is removed where the context entry is cleared
for all aliases". But this commit didn't consider the PASID entry and PASID
table in VT-d scalable mode. This fix increases the coverage of scalable
mode.
Suggested-by: Sanjay Kumar <sanjay.k.kumar@intel.com>
Fixes:
8038bdb855331 ("iommu/vt-d: Only clear real DMA device's context entries")
Fixes:
2b0140c69637e ("iommu/vt-d: Use pci_real_dma_dev() for mapping")
Cc: stable@vger.kernel.org # v5.6+
Cc: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20210712071712.3416949-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Sanjay Kumar [Mon, 12 Jul 2021 07:13:15 +0000 (15:13 +0800)]
iommu/vt-d: Global devTLB flush when present context entry changed
This fixes a bug in context cache clear operation. The code was not
following the correct invalidation flow. A global device TLB invalidation
should be added after the IOTLB invalidation. At the same time, it
uses the domain ID from the context entry. But in scalable mode, the
domain ID is in PASID table entry, not context entry.
Fixes:
7373a8cc38197 ("iommu/vt-d: Setup context and enable RID2PASID support")
Cc: stable@vger.kernel.org # v5.0+
Signed-off-by: Sanjay Kumar <sanjay.k.kumar@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20210712071315.3416543-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Marek Szyprowski [Mon, 5 Jul 2021 06:56:57 +0000 (08:56 +0200)]
iommu/qcom: Revert "iommu/arm: Cleanup resources in case of probe error path"
QCOM IOMMU driver calls bus_set_iommu() for every IOMMU device controller,
what fails for the second and latter IOMMU devices. This is intended and
must be not fatal to the driver registration process. Also the cleanup
path should take care of the runtime PM state, what is missing in the
current patch. Revert relevant changes to the QCOM IOMMU driver until
a proper fix is prepared.
This partially reverts commit
249c9dc6aa0db74a0f7908efd04acf774e19b155.
Fixes:
249c9dc6aa0d ("iommu/arm: Cleanup resources in case of probe error path")
Suggested-by: Will Deacon <will@kernel.org>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210705065657.30356-1-m.szyprowski@samsung.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Gustavo A. R. Silva [Wed, 14 Jul 2021 00:19:03 +0000 (19:19 -0500)]
powerpc/powernv: Fix fall-through warning for Clang
Fix the following fallthrough warnings (powernv_defconfig and powerpc64):
drivers/char/powernv-op-panel.c:78:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Vladimir Oltean [Tue, 13 Jul 2021 09:40:21 +0000 (12:40 +0300)]
net: dsa: properly check for the bridge_leave methods in dsa_switch_bridge_leave()
This was not caught because there is no switch driver which implements
the .port_bridge_join but not .port_bridge_leave method, but it should
nonetheless be fixed, as in certain conditions (driver development) it
might lead to NULL pointer dereference.
Fixes:
f66a6a69f97a ("net: dsa: permit cross-chip bridging between all trees in the system")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gustavo A. R. Silva [Tue, 13 Jul 2021 18:41:16 +0000 (13:41 -0500)]
MIPS: Fix unreachable code issue
Fix the following warning (mips-randconfig):
arch/mips/include/asm/fpu.h:79:3: warning: fallthrough annotation in unreachable code [-Wimplicit-fallthrough]
Originally, the /* fallthrough */ comment was introduced by commit:
597ce1723e0f ("MIPS: Support for 64-bit FP with O32 binaries")
and it was wrongly replaced with fallthrough; by commit:
c9b029903466 ("MIPS: Use fallthrough for arch/mips")
As the original comment is actually useful, fix this issue by
removing unreachable fallthrough; statement and place the original
/* fallthrough */ comment back.
Fixes:
c9b029903466 ("MIPS: Use fallthrough for arch/mips")
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/lkml/60edca25.k00ut905IFBjPyt5%25lkp@intel.com/
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Gustavo A. R. Silva [Tue, 13 Jul 2021 18:38:58 +0000 (13:38 -0500)]
MIPS: Fix fall-through warnings for Clang
Fix the following fallthrough warnings:
arch/mips/mm/tlbex.c:1386:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
arch/mips/mm/tlbex.c:2173:3: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough]
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/lkml/60edca25.k00ut905IFBjPyt5%25lkp@intel.com/
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Gustavo A. R. Silva [Tue, 13 Jul 2021 19:58:18 +0000 (14:58 -0500)]
ASoC: Mediatek: MT8183: Fix fall-through warning for Clang
Fix the following fallthrough warning:
sound/soc/mediatek/mt8183/mt8183-dai-adda.c:342:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Ville Syrjälä [Wed, 30 Jun 2021 16:44:13 +0000 (19:44 +0300)]
drm/i915/gt: Fix -EDEADLK handling regression
The conversion to ww mutexes failed to address the fence code which
already returns -EDEADLK when we run out of fences. Ww mutexes on
the other hand treat -EDEADLK as an internal errno value indicating
a need to restart the operation due to a deadlock. So now when the
fence code returns -EDEADLK the higher level code erroneously
restarts everything instead of returning the error to userspace
as is expected.
To remedy this let's switch the fence code to use a different errno
value for this. -ENOBUFS seems like a semi-reasonable unique choice.
Apart from igt the only user of this I could find is sna, and even
there all we do is dump the current fence registers from debugfs
into the X server log. So no user visible functionality is affected.
If we really cared about preserving this we could of course convert
back to -EDEADLK higher up, but doesn't seem like that's worth
the hassle here.
Not quite sure which commit specifically broke this, but I'll
just attribute it to the general gem ww mutex work.
Cc: stable@vger.kernel.org
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Testcase: igt/gem_pread/exhaustion
Testcase: igt/gem_pwrite/basic-exhaustion
Testcase: igt/gem_fenced_exec_thrash/too-many-fences
Fixes:
80f0b679d6f0 ("drm/i915: Add an implementation for i915_gem_ww_ctx locking, v2.")
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210630164413.25481-1-ville.syrjala@linux.intel.com
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
(cherry picked from commit
78d2ad7eb4e1f0e9cd5d79788446b6092c21d3e0)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Gustavo A. R. Silva [Tue, 13 Jul 2021 19:50:47 +0000 (14:50 -0500)]
power: supply: Fix fall-through warnings for Clang
Fix the following fallthrough warnings:
drivers/power/supply/ab8500_fg.c:1730:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
drivers/power/supply/abx500_chargalg.c:1155:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Gustavo A. R. Silva [Tue, 13 Jul 2021 19:48:28 +0000 (14:48 -0500)]
dmaengine: ti: k3-udma: Fix fall-through warning for Clang
Fix the following fallthrough warning:
drivers/dma/ti/k3-udma.c:4951:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Gustavo A. R. Silva [Tue, 13 Jul 2021 19:43:09 +0000 (14:43 -0500)]
s390: Fix fall-through warnings for Clang
Fix the following fallthrough warnings:
drivers/s390/net/ctcm_fsms.c:1457:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
drivers/s390/net/qeth_l3_main.c:437:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
drivers/s390/char/tape_char.c:374:4: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
arch/s390/kernel/uprobes.c:129:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Gustavo A. R. Silva [Tue, 13 Jul 2021 18:27:08 +0000 (13:27 -0500)]
dmaengine: ipu: Fix fall-through warning for Clang
Fix the following fallthrough warnings (arm64-randconfig):
drivers/dma/ipu/ipu_idmac.c:621:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
rivers/dma/ipu/ipu_idmac.c:981:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/lkml/60edca25.k00ut905IFBjPyt5%25lkp@intel.com/
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Linus Torvalds [Tue, 13 Jul 2021 19:07:18 +0000 (12:07 -0700)]
Merge tag 'vboxsf-v5.14-1' of git://git./linux/kernel/git/hansg/linux
Pull vboxsf fixes from Hans de Goede:
"This adds support for the atomic_open directory-inode op to vboxsf.
Note this is not just an enhancement this also fixes an actual issue
which users are hitting, see the commit message of the "boxsf: Add
support for the atomic_open directory-inode" patch"
* tag 'vboxsf-v5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hansg/linux:
vboxsf: Add support for the atomic_open directory-inode op
vboxsf: Add vboxsf_[create|release]_sf_handle() helpers
vboxsf: Make vboxsf_dir_create() return the handle for the created file
vboxsf: Honor excl flag to the dir-inode create op
Linus Torvalds [Tue, 13 Jul 2021 19:02:07 +0000 (12:02 -0700)]
Merge tag 'for-5.14-rc1-tag' of git://git./linux/kernel/git/kdave/linux
Pull btrfs zoned mode fixes from David Sterba:
- fix deadlock when allocating system chunk
- fix wrong mutex unlock on an error path
- fix extent map splitting for append operation
- update and fix message reporting unusable chunk space
- don't block when background zone reclaim runs with balance in
parallel
* tag 'for-5.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: fix wrong mutex unlock on failure to allocate log root tree
btrfs: don't block if we can't acquire the reclaim lock
btrfs: properly split extent_map for REQ_OP_ZONE_APPEND
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
btrfs: fix deadlock with concurrent chunk allocations involving system chunks
btrfs: zoned: print unusable percentage when reclaiming block groups
btrfs: zoned: fix types for u64 division in btrfs_reclaim_bgs_work
Gustavo A. R. Silva [Tue, 13 Jul 2021 18:25:06 +0000 (13:25 -0500)]
iommu/arm-smmu-v3: Fix fall-through warning for Clang
Fix the following fallthrough warning (arm64-randconfig with Clang):
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:382:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/lkml/60edca25.k00ut905IFBjPyt5%25lkp@intel.com/
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Gustavo A. R. Silva [Tue, 13 Jul 2021 18:23:21 +0000 (13:23 -0500)]
mmc: jz4740: Fix fall-through warning for Clang
Fix the following fallthrough warning (mips-randconfig with Clang):
drivers/mmc/host/jz4740_mmc.c:792:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/lkml/60edca25.k00ut905IFBjPyt5%25lkp@intel.com/
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>