Heiko Carstens [Thu, 29 Jul 2021 14:28:11 +0000 (16:28 +0200)]
kcsan: use u64 instead of cycles_t
cycles_t has a different type across architectures: unsigned int,
unsinged long, or unsigned long long. Depending on architecture this
will generate this warning:
kernel/kcsan/debugfs.c: In function ‘microbenchmark’:
./include/linux/kern_levels.h:5:25: warning: format ‘%llu’ expects argument of type ‘long long unsigned int’, but argument 3 has type ‘cycles_t’ {aka ‘long unsigned int’} [-Wformat=]
To avoid this simply change the type of cycle to u64 in microbenchmark(),
since u64 is of type unsigned long long for all architectures.
Acked-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20210729142811.1309391-1-hca@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Sven Schnelle [Wed, 28 Jul 2021 19:02:54 +0000 (21:02 +0200)]
s390: add kfence region to pagetable dumper
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Link: https://lore.kernel.org/r/20210728190254.3921642-5-hca@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Sven Schnelle [Wed, 28 Jul 2021 19:02:53 +0000 (21:02 +0200)]
s390: add support for KFENCE
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
[hca@linux.ibm.com: simplify/rework code]
Link: https://lore.kernel.org/r/20210728190254.3921642-4-hca@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Sven Schnelle [Wed, 28 Jul 2021 19:02:52 +0000 (21:02 +0200)]
kfence: add function to mask address bits
s390 only reports the page address during a translation fault.
To make the kfence unit tests pass, add a function that might
be implemented by architectures to mask out address bits.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20210728190254.3921642-3-hca@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Heiko Carstens [Wed, 28 Jul 2021 19:02:51 +0000 (21:02 +0200)]
s390/mm: implement set_memory_4k()
Implement set_memory_4k() which will split any present large or huge
mapping in the given range to a 4k mapping.
Link: https://lore.kernel.org/r/20210728190254.3921642-2-hca@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Marco Elver [Wed, 28 Jul 2021 19:57:41 +0000 (21:57 +0200)]
kfence, x86: only define helpers if !MODULE
x86's <asm/tlbflush.h> only declares non-module accessible functions
(such as flush_tlb_one_kernel) if !MODULE.
In preparation of including <asm/kfence.h> from the KFENCE test module,
only define the helpers if !MODULE to avoid breaking the build with
CONFIG_KFENCE_KUNIT_TEST=m.
Signed-off-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/YQJdarx6XSUQ1tFZ@elver.google.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Heiko Carstens [Sun, 25 Jul 2021 13:13:12 +0000 (15:13 +0200)]
s390/delay: get rid of not needed header includes
After all the changes to delay.c there are many includes which are not
needed anymore. Get rid of them.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Heiko Carstens [Sun, 25 Jul 2021 13:07:25 +0000 (15:07 +0200)]
s390/boot: get rid of arithmetics on function pointers
sparse warning:
CHECK arch/s390/boot/startup.c
arch/s390/boot/startup.c:283:39: error: arithmetics on pointers to functions
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Ilya Leoshkevich [Thu, 15 Jul 2021 11:51:02 +0000 (13:51 +0200)]
s390/headers: fix code style in module.h
struct brace should be on the same line.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Heiko Carstens [Wed, 21 Jul 2021 12:03:33 +0000 (14:03 +0200)]
s390/hwcaps: make sie capability regular hwcap
Commit
7f16d7e787b7 ("s390: show virtualization support in /proc/cpuinfo")
introduced special handling for sie capability, saying this should not be
exposed via hwcaps, without giving a reason.
However this leads to an inconsistent /proc/cpuinfo features line
where all features except the sie capability are also present in
hwcaps. I really don't see a reason to not add that to hwcaps - it
might be quite pointless, but at least this way it is possible to get
rid of some special handling.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Heiko Carstens [Wed, 21 Jul 2021 19:05:35 +0000 (21:05 +0200)]
s390/hwcaps: remove hwcap stfle check
Remove the not so obvious "(elf_hwcap & (1UL << 2)" which only checks
if stfle is available. This used to be required for old code before
test_facility() was introduced. test_facility() will do the right
thing, regardless if stfle is available or not.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Heiko Carstens [Wed, 21 Jul 2021 19:03:44 +0000 (21:03 +0200)]
s390/hwcaps: remove z/Architecture mode active check
Remove a leftover from the common 31/64 bit code. z/Architecture mode
is now always active, there is no need to check.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Heiko Carstens [Wed, 21 Jul 2021 19:03:04 +0000 (21:03 +0200)]
s390/hwcaps: use consistent coding style / remove comments
Use a consistent coding style within setup_hwcaps() and remove obvious
and outdated comments.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Heiko Carstens [Wed, 21 Jul 2021 18:58:03 +0000 (20:58 +0200)]
s390/hwcaps: open code initialization of first six hwcap bits
The first six hwcap bits are initialized in a rather odd way: an array
contains the stfl(e) bits which need to be set, so that the
corresponding bit position (= array index) within hwcaps are set.
Better open code it like it is done for all other bits, making it
obvious which bit is set when.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Heiko Carstens [Wed, 21 Jul 2021 11:25:21 +0000 (13:25 +0200)]
s390/hwcaps: split setup_hwcaps()
setup_hwcaps() is a quite large function. Make it smaller by moving
the elf platform setup code into an independent setup function.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Heiko Carstens [Wed, 21 Jul 2021 11:23:07 +0000 (13:23 +0200)]
s390/hwcaps: move setup_hwcaps()
Move setup_hwcaps() to processor.c for two reasons:
- make setup.c a bit smaller
- have allmost all of the hwcap code in one file
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Heiko Carstens [Wed, 21 Jul 2021 09:39:25 +0000 (11:39 +0200)]
s390/hwcaps: add sanity checks
Add BUILD_BUG_ON() sanity checks to make sure the hwcap string array
contains a string for each hwcap.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Heiko Carstens [Wed, 21 Jul 2021 09:35:19 +0000 (11:35 +0200)]
s390/hwcaps: use named initializers for hwcap string arrays
Use named initializers to make it obvious which hwcap string array
element belongs to which hwcap.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Heiko Carstens [Wed, 21 Jul 2021 09:12:36 +0000 (11:12 +0200)]
s390/hwcaps: introduce HWCAP bit numbers
Introduce HWCAP bit numbers, making it easier to tell at which bit
number we currently are. Also use these bits with the BIT macro to
define the real HWCAP masks.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Heiko Carstens [Wed, 21 Jul 2021 08:56:57 +0000 (10:56 +0200)]
s390/hwcaps: shorten HWCAP defines
Remove s390 part of all HWCAP defines, just to make them shorter and
easier to handle. The namespace is anyway per architecture.
This is similar to what arm64 has.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Niklas Schnelle [Thu, 8 Jul 2021 13:43:13 +0000 (15:43 +0200)]
s390: add HWCAP_S390_PCI_MIO to ELF hwcaps
In order to support the use of enhanced PCI instructions in both kernel-
and userspace we need both hardware support and proper setup in the
kernel. The latter can be toggled off with the pci=nomio command line
option.
Thus availability of this feature in userspace depends on all of kernel
configuration (CONFIG_PCI), hardware support and the current kernel
command line and can thus not rely solely on a facility bit. Instead
let's introduce a new ELF hardware capability bit HWCAP_S390_PCI_MIO to
tell userspace whether these PCI instructions can be used.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Niklas Schnelle [Thu, 8 Jul 2021 12:55:42 +0000 (14:55 +0200)]
s390: make PCI mio support a machine flag
Kernel support for the newer PCI mio instructions can be toggled off
with the pci=nomio command line option which needs to integrate with
common code PCI option parsing. However this option then toggles static
branches which can't be toggled yet in an early_param() call.
Thus commit
9964f396f1d0 ("s390: fix setting of mio addressing control")
moved toggling the static branches to the PCI init routine.
With this setup however we can't check for mio support outside the PCI
code during early boot, i.e. before switching the static branches, which
we need to be able to export this as an ELF HWCAP.
Improve on this by turning mio availability into a machine flag that
gets initially set based on CONFIG_PCI and the facility bit and gets
toggled off if pci=nomio is found during PCI option parsing allowing
simple access to this machine flag after early init.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Heiko Carstens [Mon, 15 Feb 2021 19:57:53 +0000 (20:57 +0100)]
s390/disassembler: add instructions
Add more instructions to the kernel disassembler.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Heiko Carstens [Wed, 13 Jan 2021 13:14:01 +0000 (14:14 +0100)]
s390: report more CPU capabilities
Add hardware capability bits and feature tags to /proc/cpuinfo
for NNPA and Vector-Packed-Decimal-Enhancement Facility 2.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Julian Wiedmann [Mon, 22 Feb 2021 09:18:33 +0000 (10:18 +0100)]
s390/qdio: remove unused macros
These macros haven't seen any use in a long time. Also note that the
queue_irqs_*() ones wouldn't even compile anymore.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Julian Wiedmann [Wed, 14 Jul 2021 16:03:51 +0000 (18:03 +0200)]
s390/qdio: clarify reporting of errors to the drivers
Now that all drivers use qdio_inspect_queue() and qdio's internal
queue tasklets are gone, the driver-specified queue handlers are
only called for async error reporting (eg. for an error condition in
the QEBSM code).
So take a moment to clean up the Output Queue handlers (they are
_always_ called with qdio_error != 0), and clarify which error types
can be reported through what interface. As Benjamin already suggested
a while ago, we should turn these into distinct enums at some point.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Julian Wiedmann [Mon, 12 Jul 2021 06:29:32 +0000 (08:29 +0200)]
s390/qdio: remove unneeded siga-sync for Output Queue
get_outbound_buffer_frontier() is only reached via qdio_inspect_queue(),
and there we already call qdio_siga_sync_q() unconditionally.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Julian Wiedmann [Mon, 15 Mar 2021 18:39:20 +0000 (19:39 +0100)]
s390/qdio: remove remaining tasklet & timer code
Both qdio drivers have moved away from using qdio's internal tasklet
and timer mechanisms for Output Queues. Rip out all the leftovers.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Julian Wiedmann [Tue, 1 Jun 2021 06:20:09 +0000 (08:20 +0200)]
s390/qdio: propagate error when cancelling a ccw fails
If qdio_cancel_ccw() times out (or is interrupted) before the interrupt
for the {halt,clear} action arrives, report this back to the caller as
an error.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Julian Wiedmann [Mon, 31 May 2021 15:38:04 +0000 (18:38 +0300)]
s390/qdio: improve roll-back after error on ESTABLISH ccw
If the ESTABLISH ccw fails (ie. the qdio_irq is set to
QDIO_IRQ_STATE_ERR), we don't need to call qdio_shutdown() for rolling
back our earlier actions. All the needed logic is already available in
qdio_establish()'s error chain, and using it means we don't have to
temporarily drop the setup_mutex either.
This makes qdio_shutdown() a purely external function, that should only
be called by the driver if an earlier qdio_establish() succeeded.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Julian Wiedmann [Mon, 31 May 2021 15:33:02 +0000 (18:33 +0300)]
s390/qdio: cancel the ESTABLISH ccw after timeout
When the ESTABLISH ccw does not complete within the specified timeout,
try our best to cancel the ccw program that is still active on the
device. Otherwise the IO subsystem might be accessing it even after
the driver eg. called qdio_free().
Fixes:
779e6e1c724d ("[S390] qdio: new qdio driver.")
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
Cc: <stable@vger.kernel.org> # 2.6.27
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Julian Wiedmann [Mon, 31 May 2021 15:40:06 +0000 (18:40 +0300)]
s390/qdio: fix roll-back after timeout on ESTABLISH ccw
When qdio_establish() times out while waiting for the ESTABLISH ccw to
complete, it calls qdio_shutdown() to roll back all of its previous
actions. But at this point the qdio_irq's state is still
QDIO_IRQ_STATE_INACTIVE, so qdio_shutdown() will exit immediately
without doing any actual work.
Which means that eg. the qdio_irq's thinint-indicator stays registered,
and cdev->handler isn't restored to its old value. And since
commit
954d6235be41 ("s390/qdio: make thinint registration symmetric")
the qdio_irq also stays on the tiq_list, so on the next qdio_establish()
we might get a helpful BUG from the list-debugging code:
...
[ 4633.512591] list_add double add: new=
00000000005a4110, prev=
00000001b357db78, next=
00000000005a4110.
[ 4633.512621] ------------[ cut here ]------------
[ 4633.512623] kernel BUG at lib/list_debug.c:29!
...
[ 4633.512796] [<
00000001b2c6ee9a>] __list_add_valid+0x82/0xa0
[ 4633.512798] ([<
00000001b2c6ee96>] __list_add_valid+0x7e/0xa0)
[ 4633.512800] [<
00000001b2fcecce>] qdio_establish_thinint+0x116/0x190
[ 4633.512805] [<
00000001b2fcbe58>] qdio_establish+0x128/0x498
...
Fix this by extracting a goto-chain from the existing error exits in
qdio_establish(), and check the return value of the wait_event_...()
to detect the timeout condition.
Fixes:
779e6e1c724d ("[S390] qdio: new qdio driver.")
Root-caused-by: Benjamin Block <bblock@linux.ibm.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
Cc: <stable@vger.kernel.org> # 2.6.27
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Alexander Egorenkov [Wed, 16 Jun 2021 12:10:03 +0000 (14:10 +0200)]
s390/setup: don't reserve memory that occupied decompressor's head
There is no useful information within [STARTUP_NORMAL_OFFSET, HEAD_END] now.
But the memory region [0, STARTUP_NORMAL_OFFSET] is used by:
* lowcore
* kdump for swapping memory
* stand-alone zipl dumpers for code, data, stack and heap
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Alexander Egorenkov [Tue, 15 Jun 2021 17:17:36 +0000 (19:17 +0200)]
s390/boot: move dma sections from decompressor to decompressed kernel
This change simplifies the task of making the decompressor relocatable.
The decompressor's image contains special DMA sections between _sdma and
_edma. This DMA segment is loaded at boot as part of the decompressor and
then simply handed over to the decompressed kernel. The decompressor itself
never uses it in any way. The primary reason for this is the need to keep
the aforementioned DMA segment below 2GB which is required by architecture,
and because the decompressor is always loaded at a fixed low physical
address, it is guaranteed that the DMA region will not cross the 2GB
memory limit. If the DMA region had been placed in the decompressed kernel,
then KASLR would make this guarantee impossible to fulfill or it would
be restricted to the first 2GB of memory address space.
This commit moves all DMA sections between _sdma and _edma from
the decompressor's image to the decompressed kernel's image. The complete
DMA region is placed in the init section of the decompressed kernel and
immediately relocated below 2GB at start-up before it is needed by other
parts of the decompressed kernel. The relocation of the DMA region happens
even if the decompressed kernel is already located below 2GB in order
to keep the first implementation simple. The relocation should not have
any noticeable impact on boot time because the DMA segment is only a couple
of pages.
After relocating the DMA sections, the kernel has to fix all references
which point into it. In order to automate this, place all variables
pointing into the DMA sections in a special .dma.refs section. All such
variables must be defined using the new __dma_ref macro. Only variables
containing addresses within the DMA sections must be placed in the new
.dma.refs section.
Furthermore, move the initialization of control registers from
the decompressor to the decompressed kernel because some control registers
reference tables that must be placed in the DMA data section to
guarantee that their addresses are below 2G. Because the decompressed
kernel relocates the DMA sections at startup, the content of control
registers CR2, CR5 and CR15 must be updated with new addresses after
the relocation. The decompressed kernel initializes all control registers
early at boot and then updates the content of CR2, CR5 and CR15
as soon as the DMA relocation has occurred. This practically reverts
the commit
a80313ff91ab ("s390/kernel: introduce .dma sections").
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Heiko Carstens [Sun, 25 Jul 2021 12:12:58 +0000 (14:12 +0200)]
s390/ctl_reg: add ctlreg5 and ctlreg15 unions
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Alexander Egorenkov [Tue, 15 Jun 2021 14:05:09 +0000 (16:05 +0200)]
s390/boot: make _diag308_reset_dma() position-independent
As a preparation for moving the .dma.data section from the decompressor to
the decompressed kernel, the .dma.data section must be made relocatable
by replacing absolute memory addressing with relative one. This is required
in order to be able to relocate the DMA section to a memory address <= 2G
as required by the hardware architecture. The DMA section must be
relocated in case the decompressed kernel was loaded to an address >= 2G
which can occur if KASAN is enabled. By making the whole DMA section
position-independent we avoid applying relocations to it whenever it is
moved to a different address, which becomes possible as soon as it becomes
a part of the decompressed kernel.
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Alexander Egorenkov [Wed, 21 Jul 2021 14:47:20 +0000 (16:47 +0200)]
s390/boot: move EP_OFFSET and EP_STRING to head.S
Both macros are used only in decompressor's head.S, unnecessary to put
them in a global header used in many places like setup.h is.
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Alexander Egorenkov [Wed, 21 Jul 2021 10:27:59 +0000 (12:27 +0200)]
s390/setup: generate asm offsets from struct parmarea
To reduce duplication, replace error-prone and hard-coded parameter area
offsets with auto-generated ones.
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Alexander Egorenkov [Wed, 21 Jul 2021 09:54:33 +0000 (11:54 +0200)]
s390/setup: drop _OFFSET macros
The macros
* IPL_DEVICE_OFFSET
* INITRD_START_OFFSET
* INITRD_SIZE_OFFSET
* OLDMEM_BASE_OFFSET
* OLDMEM_SIZE_OFFSET
* KERNEL_VERSION_OFFSET
* COMMAND_LINE_OFFSET
are no longer necessary and used only to define another set of macros
with the same names but w/o the suffix _OFFSET. Therefore, drop this
unnecessary indirection.
Drop the macro KERNEL_VERSION_OFFSET w/o renaming it to KERNEL_VERSION
because it is used nowhere.
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Alexander Egorenkov [Tue, 15 Jun 2021 10:57:31 +0000 (12:57 +0200)]
s390/setup: remove unused symbolic constants for C code from setup.h
These symbolic constants are used only by assembler code now:
* COMMAND_LINE
* IPL_DEVICE
C code of the decompressed kernel should use boot data passed
by the decompressor instead.
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Alexander Egorenkov [Tue, 15 Jun 2021 12:25:41 +0000 (14:25 +0200)]
s390/dump: introduce boot data 'oldmem_data'
The new boot data struct shall replace global variables OLDMEM_BASE and
OLDMEM_SIZE. It is initialized in the decompressor and passed
to the decompressed kernel. In comparison to the old solution, this one
doesn't access data at fixed physical addresses which will become important
when the decompressor becomes relocatable.
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Alexander Egorenkov [Tue, 15 Jun 2021 12:15:07 +0000 (14:15 +0200)]
s390/boot: introduce boot data 'initrd_data'
The new boot data struct shall replace global variables INITRD_START and
INITRD_SIZE. It is initialized in the decompressor and passed
to the decompressed kernel. In comparison to the old solution, this one
doesn't access data at fixed physical addresses which will become important
when the decompressor becomes relocatable.
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Alexander Egorenkov [Thu, 5 Nov 2020 12:09:06 +0000 (13:09 +0100)]
s390/boot: move sclp early buffer from fixed address in asm to C
To make the decompressor relocatable, the early SCLP buffer with a fixed
address must be replaced with a relocatable C buffer of the according size
and alignment as required by SCLP.
Introduce a new function sclp_early_set_buffer() into the SCLP driver
which enables the decompressor to change the SCLP early buffer at any time.
This will be useful when the decompressor becomes fully relocatable and
might need to change the SCLP early buffer to one with an address < 2G
as required by SCLP because it was loaded at an address >= 2G.
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Alexander Egorenkov [Tue, 15 Jun 2021 13:59:32 +0000 (15:59 +0200)]
s390/boot: get rid of magic numbers for startup offsets
Use STARTUP_NORMAL_OFFSET and STARTUP_KDUMP_OFFSET instead of magic numbers.
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Heiko Carstens [Tue, 13 Jul 2021 14:21:24 +0000 (16:21 +0200)]
s390/vdso: use system call functions
Use system call functions instead of open-coding svc inline
assemblies. This is mostly to get rid of even more register asm
constructs.
Besides that, it makes the code also a bit easier to understand.
The generated code is identical to what is was before.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Heiko Carstens [Tue, 13 Jul 2021 14:21:07 +0000 (16:21 +0200)]
s390/syscall: provide generic system call functions
Provide generic system call functions which should be used whenever a
system call needs to be done from user space. The only in-kernel code
is vdso, which will be converted with a follow on patch.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Heiko Carstens [Fri, 25 Jun 2021 14:42:55 +0000 (16:42 +0200)]
s390/cpacf: get rid of register asm
Using register asm statements has been proven to be very error prone,
especially when using code instrumentation where gcc may add function
calls, which clobbers register contents in an unexpected way.
Therefore get rid of register asm statements in cpacf code, and make
sure this bug class cannot happen.
Reviewed-by: Patrick Steuer <patrick.steuer@de.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Heiko Carstens [Tue, 13 Jul 2021 19:09:58 +0000 (21:09 +0200)]
s390/debug: remove unused print defines
Remove unused print defines from debug feature header file.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Heiko Carstens [Tue, 13 Jul 2021 19:07:47 +0000 (21:07 +0200)]
s390/dasd: remove debug printk
Remove dasd ioctl debug printk which seems to be a leftover from the
very early days. At least it seems to be quite pointless.
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Alexander Egorenkov [Fri, 25 Jun 2021 07:51:15 +0000 (09:51 +0200)]
s390/uv: de-duplicate checks for Protected Host Virtualization
De-duplicate checks for Protected Host Virtualization in decompressor and
kernel.
Set prot_virt_host=0 in the decompressor in *any* of the following cases
and hand it over to the decompressed kernel:
* No explicit prot_virt=1 is given on the kernel command-line
* Protected Guest Virtualization is enabled
* Hardware support not present
* kdump or stand-alone dump
The decompressed kernel needs to use only is_prot_virt_host() instead of
performing again all checks done by the decompressor.
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Alexander Egorenkov [Mon, 5 Jul 2021 17:37:25 +0000 (19:37 +0200)]
s390/boot: disable Secure Execution in dump mode
A dump kernel is neither required nor able to support Secure Execution.
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Alexander Egorenkov [Mon, 5 Jul 2021 17:33:27 +0000 (19:33 +0200)]
s390/boot: move uv function declarations to boot/uv.h
The functions adjust_to_uv_max() and uv_query_info() are used only
in the decompressor. Therefore, move the function declarations from
the global arch/s390/include/asm/uv.h to arch/s390/boot/uv.h.
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Heiko Carstens [Mon, 12 Jul 2021 17:26:01 +0000 (19:26 +0200)]
s390/jump_label: print real address in a case of a jump label bug
In case of a jump label print the real address of the piece of code
where a mismatch was detected. This is right before the system panics,
so there is nothing revealed.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Heiko Carstens [Mon, 12 Jul 2021 17:19:03 +0000 (19:19 +0200)]
s390/mm: don't print hashed values for pte_ERROR() & friends
Print the real pte, pmd, etc. values instead of some hashed
value. Otherwise debugging would be even more difficult.
This also matches what most other architectures are doing.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Heiko Carstens [Mon, 12 Jul 2021 17:33:09 +0000 (19:33 +0200)]
s390/mm: use pr_err() instead of printk() for pte_ERROR & friends
Use pr_err() to use a proper printk level.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Alexander Egorenkov [Sat, 10 Oct 2020 01:34:25 +0000 (03:34 +0200)]
s390/sclp: use only one sclp early buffer to send commands
A buffer that can be used for communication with SCLP is required
to lie below 2GB memory address. Therefore, both sclp_info_sccb
and sclp_early_sccb must fulfill this requirement if passed directly
to the sclp_early_cmd() function. Instead, use only sclp_early_sccb
for communication with SCLP. This allows the buffer sclp_info_sccb
to be placed anywhere in the memory address space and, therefore,
simplifies the process of making the decompressor relocatable later on,
one thing less to relocate. And make sure that the length of the new unified
early SCLP buffer is no less than the length of the removed sclp_info_sccb
buffer which might be larger than the length of the sclp_early_sccb buffer.
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Alexander Egorenkov [Tue, 15 Jun 2021 07:20:26 +0000 (09:20 +0200)]
s390/cio: remove unused include linux/spinlock.h from cio.h
* The linux/spinlock.h header was included indirectly by the decompressor
and brought unnecessary build dependencies.
* Use proper includes in files which either directly or indirectly included
cio.h and were hidden until now by the included linux/spinlock.h, e.g.
linux/string.h for memcpy() or asm/page.h for PAGE_SIZE.
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Alexander Egorenkov [Wed, 30 Jun 2021 15:17:53 +0000 (17:17 +0200)]
s390/boot: make stacks part of the decompressor's image
Instead of using constant addresses for the normal and dump-info stacks,
allocate both stacks in the decompressor's image and load the stack register
in a position-independent manner.
This will allow loading and entering the decompressor at an arbitrary
memory address without corrupting the content at the fixed addresses
used until now for both stacks. This is one of the prerequisites
for being able to kexec the decompressor from its load address without
relocating it first.
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Alexander Egorenkov [Wed, 30 Jun 2021 15:12:25 +0000 (17:12 +0200)]
s390/boot: move all linker symbol declarations from c to h files
To prevent multiple incompatible declarations of symbols and to catch
such mistakes at compile time.
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Linus Torvalds [Sun, 25 Jul 2021 22:35:14 +0000 (15:35 -0700)]
Linux 5.14-rc3
Linus Torvalds [Sun, 25 Jul 2021 18:06:37 +0000 (11:06 -0700)]
smpboot: fix duplicate and misplaced inlining directive
gcc doesn't care, but clang quite reasonably pointed out that the recent
commit
e9ba16e68cce ("smpboot: Mark idle_init() as __always_inlined to
work around aggressive compiler un-inlining") did some really odd
things:
kernel/smpboot.c:50:20: warning: duplicate 'inline' declaration specifier [-Wduplicate-decl-specifier]
static inline void __always_inline idle_init(unsigned int cpu)
^
which not only has that duplicate inlining specifier, but the new
__always_inline was put in the wrong place of the function definition.
We put the storage class specifiers (ie things like "static" and
"extern") first, and the type information after that. And while the
compiler may not care, we put the inline specifier before the types.
So it should be just
static __always_inline void idle_init(unsigned int cpu)
instead.
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 25 Jul 2021 17:33:48 +0000 (10:33 -0700)]
Merge tag 'powerpc-5.14-3' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix guest to host memory corruption in H_RTAS due to missing nargs
check.
- Fix guest triggerable host crashes due to bad handling of nested
guest TM state.
- Fix possible crashes due to incorrect reference counting in
kvm_arch_vcpu_ioctl().
- Two commits fixing some regressions in KVM transactional memory
handling introduced by the recent rework of the KVM code.
Thanks to Nicholas Piggin, Alexey Kardashevskiy, and Michael Neuling.
* tag 'powerpc-5.14-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
KVM: PPC: Book3S HV Nested: Sanitise H_ENTER_NESTED TM state
KVM: PPC: Book3S: Fix H_RTAS rets buffer overflow
KVM: PPC: Fix kvm_arch_vcpu_ioctl vcpu_load leak
KVM: PPC: Book3S: Fix CONFIG_TRANSACTIONAL_MEM=n crash
KVM: PPC: Book3S HV P9: Fix guest TM support
Linus Torvalds [Sun, 25 Jul 2021 17:27:44 +0000 (10:27 -0700)]
Merge tag 'timers-urgent-2021-07-25' of git://git./linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"A small set of timer related fixes:
- Plug a race between rearm and process tick in the posix CPU timers
code
- Make the optimization to avoid recalculation of the next timer
interrupt work correctly when there are no timers pending"
* tag 'timers-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers: Fix get_next_timer_interrupt() with no timers pending
posix-cpu-timers: Fix rearm racing against process tick
Linus Torvalds [Sun, 25 Jul 2021 17:21:19 +0000 (10:21 -0700)]
Merge tag 'locking-urgent-2021-07-25' of git://git./linux/kernel/git/tip/tip
Pull x86 jump label fix from Thomas Gleixner:
"A single fix for jump labels to prevent the compiler from agressive
un-inlining which results in a section mismatch"
* tag 'locking-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
jump_labels: Mark __jump_label_transform() as __always_inlined to work around aggressive compiler un-inlining
Linus Torvalds [Sun, 25 Jul 2021 17:04:27 +0000 (10:04 -0700)]
Merge tag 'efi-urgent-2021-07-25' of git://git./linux/kernel/git/tip/tip
Pull EFI fixes from Thomas Gleixner:
"A set of EFI fixes:
- Prevent memblock and I/O reserved resources to get out of sync when
EFI memreserve is in use.
- Don't claim a non-existing table is invalid
- Don't warn when firmware memory is already reserved correctly"
* tag 'efi-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi/mokvar: Reserve the table only if it is in boot services data
efi/libstub: Fix the efi_load_initrd function description
firmware/efi: Tell memblock about EFI iomem reservations
efi/tpm: Differentiate missing and invalid final event log table.
Linus Torvalds [Sun, 25 Jul 2021 16:52:48 +0000 (09:52 -0700)]
Merge tag 'core-urgent-2021-07-25' of git://git./linux/kernel/git/tip/tip
Pull core fix from Thomas Gleixner:
"A single update for the boot code to prevent aggressive un-inlining
which causes a section mismatch"
* tag 'core-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smpboot: Mark idle_init() as __always_inlined to work around aggressive compiler un-inlining
Linus Torvalds [Sun, 25 Jul 2021 16:46:17 +0000 (09:46 -0700)]
Merge tag 'dma-mapping-5.14-1' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fix from Christoph Hellwig:
- handle vmalloc addresses in dma_common_{mmap,get_sgtable} (Roman
Skakun)
* tag 'dma-mapping-5.14-1' of git://git.infradead.org/users/hch/dma-mapping:
dma-mapping: handle vmalloc addresses in dma_common_{mmap,get_sgtable}
Linus Torvalds [Sun, 25 Jul 2021 00:26:47 +0000 (17:26 -0700)]
Merge tag '5.14-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Five cifs/smb3 fixes, including a DFS failover fix, two fallocate
fixes, and two trivial coverity cleanups"
* tag '5.14-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix fallocate when trying to allocate a hole.
CIFS: Clarify SMB1 code for POSIX delete file
CIFS: Clarify SMB1 code for POSIX Create
cifs: support share failover when remounting
cifs: only write 64kb at a time when fallocating a small region of a file
Linus Torvalds [Sat, 24 Jul 2021 22:34:04 +0000 (15:34 -0700)]
Merge tag 'riscv-for-linus-5.14-rc3' of git://git./linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- properly set the memory size, which fixes 32-bit systems
- allow initrd to load anywhere in memory, rather that restricting it
to the first 256MiB
- fix the 'mem=' parameter on 64-bit systems to properly account for
the maximum supported memory now that the kernel is outside the
linear map
- avoid installing mappings into the last 4KiB of memory, which
conflicts with error values
- avoid the stack from being freed while it is being walked
- a handful of fixes to the new copy to/from user routines
* tag 'riscv-for-linus-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: __asm_copy_to-from_user: Fix: Typos in comments
riscv: __asm_copy_to-from_user: Remove unnecessary size check
riscv: __asm_copy_to-from_user: Fix: fail on RV32
riscv: __asm_copy_to-from_user: Fix: overrun copy
riscv: stacktrace: pin the task's stack in get_wchan
riscv: Make sure the kernel mapping does not overlap with IS_ERR_VALUE
riscv: Make sure the linear mapping does not use the kernel mapping
riscv: Fix memory_limit for 64-bit kernel
RISC-V: load initrd wherever it fits into memory
riscv: Fix 32-bit RISC-V boot failure
Linus Torvalds [Sat, 24 Jul 2021 22:25:54 +0000 (15:25 -0700)]
ACPI: fix NULL pointer dereference
Commit
71f642833284 ("ACPI: utils: Fix reference counting in
for_each_acpi_dev_match()") started doing "acpi_dev_put()" on a pointer
that was possibly NULL. That fails miserably, because that helper
inline function is not set up to handle that case.
Just make acpi_dev_put() silently accept a NULL pointer, rather than
calling down to put_device() with an invalid offset off that NULL
pointer.
Link: https://lore.kernel.org/lkml/a607c149-6bf6-0fd0-0e31-100378504da2@kernel.dk/
Reported-and-tested-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Daniel Scally <djrscally@gmail.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 24 Jul 2021 20:08:31 +0000 (13:08 -0700)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Four fixes, all in drivers, all of which can lead to user visible
problems in certain situations"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: target: Fix NULL dereference on XCOPY completion
scsi: mpt3sas: Transition IOC to Ready state during shutdown
scsi: target: Fix protect handling in WRITE SAME(32)
scsi: iscsi: Fix iface sysfs attr detection
Linus Torvalds [Sat, 24 Jul 2021 20:03:40 +0000 (13:03 -0700)]
Merge tag 'io_uring-5.14-2021-07-24' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
- Fix a memory leak due to a race condition in io_init_wq_offload
(Yang)
- Poll error handling fixes (Pavel)
- Fix early fdput() regression (me)
- Don't reissue iopoll requests off release path (me)
- Add a safety check for io-wq queue off wrong path (me)
* tag 'io_uring-5.14-2021-07-24' of git://git.kernel.dk/linux-block:
io_uring: explicitly catch any illegal async queue attempt
io_uring: never attempt iopoll reissue from release path
io_uring: fix early fdput() of file
io_uring: fix memleak in io_init_wq_offload()
io_uring: remove double poll entry on arm failure
io_uring: explicitly count entries for poll reqs
Linus Torvalds [Sat, 24 Jul 2021 19:57:06 +0000 (12:57 -0700)]
Merge tag 'block-5.14-2021-07-24' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- NVMe pull request (Christoph):
- tracing fix (Keith Busch)
- fix multipath head refcounting (Hannes Reinecke)
- Write Zeroes vs PI fix (me)
- drop a bogus WARN_ON (Zhihao Cheng)
- Increase max blk-cgroup policy size, now that mq-deadline
uses it too (Oleksandr)
* tag 'block-5.14-2021-07-24' of git://git.kernel.dk/linux-block:
nvme: set the PRACT bit when using Write Zeroes with T10 PI
nvme: fix nvme_setup_command metadata trace event
nvme: fix refcounting imbalance when all paths are down
nvme-pci: don't WARN_ON in nvme_reset_work if ctrl.state is not RESETTING
block: increase BLKCG_MAX_POLS
Linus Torvalds [Sat, 24 Jul 2021 19:55:06 +0000 (12:55 -0700)]
Merge branch 'i2c/for-current' of git://git./linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"Two bugfixes for the I2C subsystem"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: mpc: Poll for MCF
misc: eeprom: at24: Always append device id even if label property is set.
Linus Torvalds [Sat, 24 Jul 2021 19:27:16 +0000 (12:27 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge misc mm fixes from Andrew Morton:
"15 patches.
VM subsystems affected by this patch series: userfaultfd, kfence,
highmem, pagealloc, memblock, pagecache, secretmem, pagemap, and
hugetlbfs"
* akpm:
hugetlbfs: fix mount mode command line processing
mm: fix the deadlock in finish_fault()
mm: mmap_lock: fix disabling preemption directly
mm/secretmem: wire up ->set_page_dirty
writeback, cgroup: do not reparent dax inodes
writeback, cgroup: remove wb from offline list before releasing refcnt
memblock: make for_each_mem_range() traverse MEMBLOCK_HOTPLUG regions
mm: page_alloc: fix page_poison=1 / INIT_ON_ALLOC_DEFAULT_ON interaction
mm: use kmap_local_page in memzero_page
mm: call flush_dcache_page() in memcpy_to_page() and memzero_page()
kfence: skip all GFP_ZONEMASK allocations
kfence: move the size check to the beginning of __kfence_alloc()
kfence: defer kfence_test_init to ensure that kunit debugfs is created
selftest: use mmap instead of posix_memalign to allocate memory
userfaultfd: do not untag user pointers
Akira Tsukamoto [Tue, 20 Jul 2021 08:53:23 +0000 (17:53 +0900)]
riscv: __asm_copy_to-from_user: Fix: Typos in comments
Fixing typos and grammar mistakes and using more intuitive label
name.
Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com>
Fixes:
ca6eaaa210de ("riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall")
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Akira Tsukamoto [Tue, 20 Jul 2021 08:52:36 +0000 (17:52 +0900)]
riscv: __asm_copy_to-from_user: Remove unnecessary size check
Clean up:
The size of 0 will be evaluated in the next step. Not
required here.
Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com>
Fixes:
ca6eaaa210de ("riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall")
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Akira Tsukamoto [Tue, 20 Jul 2021 08:51:45 +0000 (17:51 +0900)]
riscv: __asm_copy_to-from_user: Fix: fail on RV32
Had a bug when converting bytes to bits when the cpu was rv32.
The a3 contains the number of bytes and multiple of 8
would be the bits. The LGREG is holding 2 for RV32 and 3 for
RV32, so to achieve multiple of 8 it must always be constant 3.
The 2 was mistakenly used for rv32.
Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com>
Fixes:
ca6eaaa210de ("riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall")
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Akira Tsukamoto [Tue, 20 Jul 2021 08:50:52 +0000 (17:50 +0900)]
riscv: __asm_copy_to-from_user: Fix: overrun copy
There were two causes for the overrun memory access.
The threshold size was too small.
The aligning dst require one SZREG and unrolling word copy requires
8*SZREG, total have to be at least 9*SZREG.
Inside the unrolling copy, the subtracting -(8*SZREG-1) would make
iteration happening one extra loop. Proper value is -(8*SZREG).
Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com>
Fixes:
ca6eaaa210de ("riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall")
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Mike Kravetz [Fri, 23 Jul 2021 22:50:44 +0000 (15:50 -0700)]
hugetlbfs: fix mount mode command line processing
In commit
32021982a324 ("hugetlbfs: Convert to fs_context") processing
of the mount mode string was changed from match_octal() to fsparam_u32.
This changed existing behavior as match_octal does not require octal
values to have a '0' prefix, but fsparam_u32 does.
Use fsparam_u32oct which provides the same behavior as match_octal.
Link: https://lkml.kernel.org/r/20210721183326.102716-1-mike.kravetz@oracle.com
Fixes:
32021982a324 ("hugetlbfs: Convert to fs_context")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Dennis Camera <bugs+kernel.org@dtnr.ch>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Qi Zheng [Fri, 23 Jul 2021 22:50:41 +0000 (15:50 -0700)]
mm: fix the deadlock in finish_fault()
Commit
63f3655f9501 ("mm, memcg: fix reclaim deadlock with writeback")
fix the following ABBA deadlock by pre-allocating the pte page table
without holding the page lock.
lock_page(A)
SetPageWriteback(A)
unlock_page(A)
lock_page(B)
lock_page(B)
pte_alloc_one
shrink_page_list
wait_on_page_writeback(A)
SetPageWriteback(B)
unlock_page(B)
# flush A, B to clear the writeback
Commit
f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault()
codepaths") reworked the relevant code but ignored this race. This will
cause the deadlock above to appear again, so fix it.
Link: https://lkml.kernel.org/r/20210721074849.57004-1-zhengqi.arch@bytedance.com
Fixes:
f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths")
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Muchun Song [Fri, 23 Jul 2021 22:50:38 +0000 (15:50 -0700)]
mm: mmap_lock: fix disabling preemption directly
Commit
832b50725373 ("mm: mmap_lock: use local locks instead of
disabling preemption") fixed a bug by using local locks.
But commit
d01079f3d0c0 ("mm/mmap_lock: remove dead code for
!CONFIG_TRACING configurations") changed those lines back to the
original version.
I guess it was introduced by fixing conflicts.
Link: https://lkml.kernel.org/r/20210720074228.76342-1-songmuchun@bytedance.com
Fixes:
d01079f3d0c0 ("mm/mmap_lock: remove dead code for !CONFIG_TRACING configurations")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Fri, 23 Jul 2021 22:50:35 +0000 (15:50 -0700)]
mm/secretmem: wire up ->set_page_dirty
Make secretmem up to date with the changes done in commit
0af573780b0b
("mm: require ->set_page_dirty to be explicitly wired up") so that
unconditional call to this method won't cause crashes.
Link: https://lkml.kernel.org/r/20210716063933.31633-1-rppt@kernel.org
Fixes:
0af573780b0b ("mm: require ->set_page_dirty to be explicitly wired up")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Roman Gushchin [Fri, 23 Jul 2021 22:50:32 +0000 (15:50 -0700)]
writeback, cgroup: do not reparent dax inodes
The inode switching code is not suited for dax inodes. An attempt to
switch a dax inode to a parent writeback structure (as a part of a
writeback cleanup procedure) results in a panic like this:
run fstests generic/270 at 2021-07-15 05:54:02
XFS (pmem0p2): EXPERIMENTAL big timestamp feature in use. Use at your own risk!
XFS (pmem0p2): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
XFS (pmem0p2): EXPERIMENTAL inode btree counters feature in use. Use at your own risk!
XFS (pmem0p2): Mounting V5 Filesystem
XFS (pmem0p2): Ending clean mount
XFS (pmem0p2): Quotacheck needed: Please wait.
XFS (pmem0p2): Quotacheck: Done.
XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
BUG: unable to handle page fault for address:
0000000005b0f669
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 13 PID: 10479 Comm: kworker/13:16 Not tainted 5.14.0-rc1-master-
8096acd7442e+ #8
Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 09/13/2016
Workqueue: inode_switch_wbs inode_switch_wbs_work_fn
RIP: 0010:inode_do_switch_wbs+0xaf/0x470
Code: 00 30 0f 85 c1 03 00 00 0f 1f 44 00 00 31 d2 48 c7 c6 ff ff ff ff 48 8d 7c 24 08 e8 eb 49 1a 00 48 85 c0 74 4a bb ff ff ff ff <48> 8b 50 08 48 8d 4a ff 83 e2 01 48 0f 45 c1 48 8b 00 a8 08 0f 85
RSP: 0018:
ffff9c66691abdc8 EFLAGS:
00010002
RAX:
0000000005b0f661 RBX:
00000000ffffffff RCX:
ffff89e6a21382b0
RDX:
0000000000000001 RSI:
ffff89e350230248 RDI:
ffffffffffffffff
RBP:
ffff89e681d19400 R08:
0000000000000000 R09:
0000000000000228
R10:
ffffffffffffffff R11:
ffffffffffffffc0 R12:
ffff89e6a2138130
R13:
ffff89e316af7400 R14:
ffff89e316af6e78 R15:
ffff89e6a21382b0
FS:
0000000000000000(0000) GS:
ffff89ee5fb40000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
0000000005b0f669 CR3:
0000000cb2410004 CR4:
00000000001706e0
Call Trace:
inode_switch_wbs_work_fn+0xb6/0x2a0
process_one_work+0x1e6/0x380
worker_thread+0x53/0x3d0
kthread+0x10f/0x130
ret_from_fork+0x22/0x30
Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_counter nf_tables nfnetlink bridge stp llc rfkill sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm mgag200 i2c_algo_bit iTCO_wdt irqbypass drm_kms_helper iTCO_vendor_support acpi_ipmi rapl syscopyarea sysfillrect intel_cstate ipmi_si sysimgblt ioatdma dax_pmem_compat fb_sys_fops ipmi_devintf device_dax i2c_i801 pcspkr intel_uncore hpilo nd_pmem cec dax_pmem_core dca i2c_smbus acpi_tad lpc_ich ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sd_mod t10_pi crct10dif_pclmul crc32_pclmul crc32c_intel tg3 ghash_clmulni_intel serio_raw hpsa hpwdt scsi_transport_sas wmi dm_mirror dm_region_hash dm_log dm_mod
CR2:
0000000005b0f669
---[ end trace
ed2105faff8384f3 ]---
RIP: 0010:inode_do_switch_wbs+0xaf/0x470
Code: 00 30 0f 85 c1 03 00 00 0f 1f 44 00 00 31 d2 48 c7 c6 ff ff ff ff 48 8d 7c 24 08 e8 eb 49 1a 00 48 85 c0 74 4a bb ff ff ff ff <48> 8b 50 08 48 8d 4a ff 83 e2 01 48 0f 45 c1 48 8b 00 a8 08 0f 85
RSP: 0018:
ffff9c66691abdc8 EFLAGS:
00010002
RAX:
0000000005b0f661 RBX:
00000000ffffffff RCX:
ffff89e6a21382b0
RDX:
0000000000000001 RSI:
ffff89e350230248 RDI:
ffffffffffffffff
RBP:
ffff89e681d19400 R08:
0000000000000000 R09:
0000000000000228
R10:
ffffffffffffffff R11:
ffffffffffffffc0 R12:
ffff89e6a2138130
R13:
ffff89e316af7400 R14:
ffff89e316af6e78 R15:
ffff89e6a21382b0
FS:
0000000000000000(0000) GS:
ffff89ee5fb40000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
0000000005b0f669 CR3:
0000000cb2410004 CR4:
00000000001706e0
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x15200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: Fatal exception ]---
The crash happens on an attempt to iterate over attached pagecache pages
and check the dirty flag: a dax inode's xarray contains pfn's instead of
generic struct page pointers.
This happens for DAX and not for other kinds of non-page entries in the
inodes because it's a tagged iteration, and shadow/swap entries are
never tagged; only DAX entries get tagged.
Fix the problem by bailing out (with the false return value) of
inode_prepare_sbs_switch() if a dax inode is passed.
[willy@infradead.org: changelog addition]
Link: https://lkml.kernel.org/r/20210719171350.3876830-1-guro@fb.com
Fixes:
c22d70a162d3 ("writeback, cgroup: release dying cgwbs by switching attached inodes")
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Murphy Zhou <jencce.kernel@gmail.com>
Reported-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Murphy Zhou <jencce.kernel@gmail.com>
Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Roman Gushchin [Fri, 23 Jul 2021 22:50:29 +0000 (15:50 -0700)]
writeback, cgroup: remove wb from offline list before releasing refcnt
Boyang reported that the commit
c22d70a162d3 ("writeback, cgroup:
release dying cgwbs by switching attached inodes") causes the kernel to
crash while running xfstests generic/256 on ext4 on aarch64 and ppc64le.
run fstests generic/256 at 2021-07-12 05:41:40
EXT4-fs (vda3): mounted filesystem with ordered data mode. Opts: . Quota mode: none.
Unable to handle kernel NULL pointer dereference at virtual address
0000000000000000
Mem abort info:
ESR = 0x96000005
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
FSC = 0x05: level 1 translation fault
Data abort info:
ISV = 0, ISS = 0x00000005
CM = 0, WnR = 0
user pgtable: 64k pages, 48-bit VAs, pgdp=
00000000b0502000
[
0000000000000000] pgd=
0000000000000000, p4d=
0000000000000000, pud=
0000000000000000
Internal error: Oops:
96000005 [#1] SMP
Modules linked in: dm_flakey dm_snapshot dm_bufio dm_zero dm_mod loop tls rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill sunrpc ext4 vfat fat mbcache jbd2 drm fuse xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce virtio_blk virtio_net net_failover virtio_console failover virtio_mmio aes_neon_bs [last unloaded: scsi_debug]
CPU: 0 PID: 408468 Comm: kworker/u8:5 Tainted: G X --------- --- 5.14.0-0.rc1.15.bx.el9.aarch64 #1
Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
Workqueue: events_unbound cleanup_offline_cgwbs_workfn
pstate:
004000c5 (nzcv daIF +PAN -UAO -TCO BTYPE=--)
pc : cleanup_offline_cgwbs_workfn+0x320/0x394
lr : cleanup_offline_cgwbs_workfn+0xe0/0x394
sp :
ffff80001554fd10
x29:
ffff80001554fd10 x28:
0000000000000000 x27:
0000000000000001
x26:
0000000000000000 x25:
00000000000000e0 x24:
ffffd2a2fbe671a8
x23:
ffff80001554fd88 x22:
ffffd2a2fbe67198 x21:
ffffd2a2fc25a730
x20:
ffff210412bc3000 x19:
ffff210412bc3280 x18:
0000000000000000
x17:
0000000000000000 x16:
0000000000000000 x15:
0000000000000000
x14:
0000000000000000 x13:
0000000000000030 x12:
0000000000000040
x11:
ffff210481572238 x10:
ffff21048157223a x9 :
ffffd2a2fa276c60
x8 :
ffff210484106b60 x7 :
0000000000000000 x6 :
000000000007d18a
x5 :
ffff210416a86400 x4 :
ffff210412bc0280 x3 :
0000000000000000
x2 :
ffff80001554fd88 x1 :
ffff210412bc0280 x0 :
0000000000000003
Call trace:
cleanup_offline_cgwbs_workfn+0x320/0x394
process_one_work+0x1f4/0x4b0
worker_thread+0x184/0x540
kthread+0x114/0x120
ret_from_fork+0x10/0x18
Code:
d63f0020 97f99963 17ffffa6 f8588263 (
f9400061)
---[ end trace
e250fe289272792a ]---
Kernel panic - not syncing: Oops: Fatal exception
SMP: stopping secondary CPUs
SMP: failed to stop secondary CPUs 0-2
Kernel Offset: 0x52a2e9fa0000 from 0xffff800010000000
PHYS_OFFSET: 0xfff0defca0000000
CPU features: 0x00200251,
23200840
Memory Limit: none
---[ end Kernel panic - not syncing: Oops: Fatal exception ]---
The problem happens when cgwb_release_workfn() races with
cleanup_offline_cgwbs_workfn(): wb_tryget() in
cleanup_offline_cgwbs_workfn() can be called after percpu_ref_exit() is
cgwb_release_workfn(), which is basically a use-after-free error.
Fix the problem by making removing the writeback structure from the
offline list before releasing the percpu reference counter. It will
guarantee that cleanup_offline_cgwbs_workfn() will not see and not
access writeback structures which are about to be released.
Link: https://lkml.kernel.org/r/20210716201039.3762203-1-guro@fb.com
Fixes:
c22d70a162d3 ("writeback, cgroup: release dying cgwbs by switching attached inodes")
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Boyang Xue <bxue@redhat.com>
Suggested-by: Jan Kara <jack@suse.cz>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Murphy Zhou <jencce.kernel@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Fri, 23 Jul 2021 22:50:26 +0000 (15:50 -0700)]
memblock: make for_each_mem_range() traverse MEMBLOCK_HOTPLUG regions
Commit
b10d6bca8720 ("arch, drivers: replace for_each_membock() with
for_each_mem_range()") didn't take into account that when there is
movable_node parameter in the kernel command line, for_each_mem_range()
would skip ranges marked with MEMBLOCK_HOTPLUG.
The page table setup code in POWER uses for_each_mem_range() to create
the linear mapping of the physical memory and since the regions marked
as MEMORY_HOTPLUG are skipped, they never make it to the linear map.
A later access to the memory in those ranges will fail:
BUG: Unable to handle kernel data access on write at 0xc000000400000000
Faulting instruction address: 0xc00000000008a3c0
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 0 PID: 53 Comm: kworker/u2:0 Not tainted 5.13.0 #7
NIP:
c00000000008a3c0 LR:
c0000000003c1ed8 CTR:
0000000000000040
REGS:
c000000008a57770 TRAP: 0300 Not tainted (5.13.0)
MSR:
8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR:
84222202 XER:
20040000
CFAR:
c0000000003c1ed4 DAR:
c000000400000000 DSISR:
42000000 IRQMASK: 0
GPR00:
c0000000003c1ed8 c000000008a57a10 c0000000019da700 c000000400000000
GPR04:
0000000000000280 0000000000000180 0000000000000400 0000000000000200
GPR08:
0000000000000100 0000000000000080 0000000000000040 0000000000000300
GPR12:
0000000000000380 c000000001bc0000 c0000000001660c8 c000000006337e00
GPR16:
0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20:
0000000040000000 0000000020000000 c000000001a81990 c000000008c30000
GPR24:
c000000008c20000 c000000001a81998 000fffffffff0000 c000000001a819a0
GPR28:
c000000001a81908 c00c000001000000 c000000008c40000 c000000008a64680
NIP clear_user_page+0x50/0x80
LR __handle_mm_fault+0xc88/0x1910
Call Trace:
__handle_mm_fault+0xc44/0x1910 (unreliable)
handle_mm_fault+0x130/0x2a0
__get_user_pages+0x248/0x610
__get_user_pages_remote+0x12c/0x3e0
get_arg_page+0x54/0xf0
copy_string_kernel+0x11c/0x210
kernel_execve+0x16c/0x220
call_usermodehelper_exec_async+0x1b0/0x2f0
ret_from_kernel_thread+0x5c/0x70
Instruction dump:
79280fa4 79271764 79261f24 794ae8e2 7ca94214 7d683a14 7c893a14 7d893050
7d4903a6 60000000 60000000 60000000 <
7c001fec>
7c091fec 7c081fec 7c051fec
---[ end trace
490b8c67e6075e09 ]---
Making for_each_mem_range() include MEMBLOCK_HOTPLUG regions in the
traversal fixes this issue.
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1976100
Link: https://lkml.kernel.org/r/20210712071132.20902-1-rppt@kernel.org
Fixes:
b10d6bca8720 ("arch, drivers: replace for_each_membock() with for_each_mem_range()")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Tested-by: Greg Kurz <groug@kaod.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org> [5.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sergei Trofimovich [Fri, 23 Jul 2021 22:50:23 +0000 (15:50 -0700)]
mm: page_alloc: fix page_poison=1 / INIT_ON_ALLOC_DEFAULT_ON interaction
To reproduce the failure we need the following system:
- kernel command: page_poison=1 init_on_free=0 init_on_alloc=0
- kernel config:
* CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y
* CONFIG_INIT_ON_FREE_DEFAULT_ON=y
* CONFIG_PAGE_POISONING=y
Resulting in:
0000000085629bdd: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0000000022861832: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000000c597f5b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
CPU: 11 PID: 15195 Comm: bash Kdump: loaded Tainted: G U O 5.13.1-gentoo-x86_64 #1
Hardware name: System manufacturer System Product Name/PRIME Z370-A, BIOS 2801 01/13/2021
Call Trace:
dump_stack+0x64/0x7c
__kernel_unpoison_pages.cold+0x48/0x84
post_alloc_hook+0x60/0xa0
get_page_from_freelist+0xdb8/0x1000
__alloc_pages+0x163/0x2b0
__get_free_pages+0xc/0x30
pgd_alloc+0x2e/0x1a0
mm_init+0x185/0x270
dup_mm+0x6b/0x4f0
copy_process+0x190d/0x1b10
kernel_clone+0xba/0x3b0
__do_sys_clone+0x8f/0xb0
do_syscall_64+0x68/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Before commit
51cba1ebc60d ("init_on_alloc: Optimize static branches")
init_on_alloc never enabled static branch by default. It could only be
enabed explicitly by init_mem_debugging_and_hardening().
But after commit
51cba1ebc60d, a static branch could already be enabled
by default. There was no code to ever disable it. That caused
page_poison=1 / init_on_free=1 conflict.
This change extends init_mem_debugging_and_hardening() to also disable
static branch disabling.
Link: https://lkml.kernel.org/r/20210714031935.4094114-1-keescook@chromium.org
Link: https://lore.kernel.org/r/20210712215816.1512739-1-slyfox@gentoo.org
Fixes:
51cba1ebc60d ("init_on_alloc: Optimize static branches")
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Co-developed-by: Kees Cook <keescook@chromium.org>
Reported-by: Mikhail Morfikov <mmorfikov@gmail.com>
Reported-by: <bowsingbetee@pm.me>
Tested-by: <bowsingbetee@protonmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Christoph Hellwig [Fri, 23 Jul 2021 22:50:20 +0000 (15:50 -0700)]
mm: use kmap_local_page in memzero_page
The commit message introducing the global memzero_page explicitly
mentions switching to kmap_local_page in the commit log but doesn't
actually do that.
Link: https://lkml.kernel.org/r/20210713055231.137602-3-hch@lst.de
Fixes:
28961998f858 ("iov_iter: lift memzero_page() to highmem.h")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Christoph Hellwig [Fri, 23 Jul 2021 22:50:17 +0000 (15:50 -0700)]
mm: call flush_dcache_page() in memcpy_to_page() and memzero_page()
memcpy_to_page and memzero_page can write to arbitrary pages, which
could be in the page cache or in high memory, so call
flush_kernel_dcache_pages to flush the dcache.
This is a problem when using these helpers on dcache challeneged
architectures. Right now there are just a few users, chances are no one
used the PC floppy driver, the aha1542 driver for an ISA SCSI HBA, and a
few advanced and optional btrfs and ext4 features on those platforms yet
since the conversion.
Link: https://lkml.kernel.org/r/20210713055231.137602-2-hch@lst.de
Fixes:
bb90d4bc7b6a ("mm/highmem: Lift memcpy_[to|from]_page to core")
Fixes:
28961998f858 ("iov_iter: lift memzero_page() to highmem.h")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexander Potapenko [Fri, 23 Jul 2021 22:50:14 +0000 (15:50 -0700)]
kfence: skip all GFP_ZONEMASK allocations
Allocation requests outside ZONE_NORMAL (MOVABLE, HIGHMEM or DMA) cannot
be fulfilled by KFENCE, because KFENCE memory pool is located in a zone
different from the requested one.
Because callers of kmem_cache_alloc() may actually rely on the
allocation to reside in the requested zone (e.g. memory allocations
done with __GFP_DMA must be DMAable), skip all allocations done with
GFP_ZONEMASK and/or respective SLAB flags (SLAB_CACHE_DMA and
SLAB_CACHE_DMA32).
Link: https://lkml.kernel.org/r/20210714092222.1890268-2-glider@google.com
Fixes:
0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure")
Signed-off-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Acked-by: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: <stable@vger.kernel.org> [5.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexander Potapenko [Fri, 23 Jul 2021 22:50:11 +0000 (15:50 -0700)]
kfence: move the size check to the beginning of __kfence_alloc()
Check the allocation size before toggling kfence_allocation_gate.
This way allocations that can't be served by KFENCE will not result in
waiting for another CONFIG_KFENCE_SAMPLE_INTERVAL without allocating
anything.
Link: https://lkml.kernel.org/r/20210714092222.1890268-1-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Suggested-by: Marco Elver <elver@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org> [5.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Weizhao Ouyang [Fri, 23 Jul 2021 22:50:08 +0000 (15:50 -0700)]
kfence: defer kfence_test_init to ensure that kunit debugfs is created
kfence_test_init and kunit_init both use the same level late_initcall,
which means if kfence_test_init linked ahead of kunit_init,
kfence_test_init will get a NULL debugfs_rootdir as parent dentry, then
kfence_test_init and kfence_debugfs_init both create a debugfs node
named "kfence" under debugfs_mount->mnt_root, and it will throw out
"debugfs: Directory 'kfence' with parent '/' already present!" with
EEXIST. So kfence_test_init should be deferred.
Link: https://lkml.kernel.org/r/20210714113140.2949995-1-o451686892@gmail.com
Signed-off-by: Weizhao Ouyang <o451686892@gmail.com>
Tested-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Collingbourne [Fri, 23 Jul 2021 22:50:04 +0000 (15:50 -0700)]
selftest: use mmap instead of posix_memalign to allocate memory
This test passes pointers obtained from anon_allocate_area to the
userfaultfd and mremap APIs. This causes a problem if the system
allocator returns tagged pointers because with the tagged address ABI
the kernel rejects tagged addresses passed to these APIs, which would
end up causing the test to fail. To make this test compatible with such
system allocators, stop using the system allocator to allocate memory in
anon_allocate_area, and instead just use mmap.
Link: https://lkml.kernel.org/r/20210714195437.118982-3-pcc@google.com
Link: https://linux-review.googlesource.com/id/Icac91064fcd923f77a83e8e133f8631c5b8fc241
Fixes:
c47174fc362a ("userfaultfd: selftest")
Co-developed-by: Lokesh Gidra <lokeshgidra@google.com>
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Alistair Delva <adelva@google.com>
Cc: William McVicker <willmcvicker@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Mitch Phillips <mitchp@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: <stable@vger.kernel.org> [5.4]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Collingbourne [Fri, 23 Jul 2021 22:50:01 +0000 (15:50 -0700)]
userfaultfd: do not untag user pointers
Patch series "userfaultfd: do not untag user pointers", v5.
If a user program uses userfaultfd on ranges of heap memory, it may end
up passing a tagged pointer to the kernel in the range.start field of
the UFFDIO_REGISTER ioctl. This can happen when using an MTE-capable
allocator, or on Android if using the Tagged Pointers feature for MTE
readiness [1].
When a fault subsequently occurs, the tag is stripped from the fault
address returned to the application in the fault.address field of struct
uffd_msg. However, from the application's perspective, the tagged
address *is* the memory address, so if the application is unaware of
memory tags, it may get confused by receiving an address that is, from
its point of view, outside of the bounds of the allocation. We observed
this behavior in the kselftest for userfaultfd [2] but other
applications could have the same problem.
Address this by not untagging pointers passed to the userfaultfd ioctls.
Instead, let the system call fail. Also change the kselftest to use
mmap so that it doesn't encounter this problem.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
This patch (of 2):
Do not untag pointers passed to the userfaultfd ioctls. Instead, let
the system call fail. This will provide an early indication of problems
with tag-unaware userspace code instead of letting the code get confused
later, and is consistent with how we decided to handle brk/mmap/mremap
in commit
dcde237319e6 ("mm: Avoid creating virtual address aliases in
brk()/mmap()/mremap()"), as well as being consistent with the existing
tagged address ABI documentation relating to how ioctl arguments are
handled.
The code change is a revert of commit
7d0325749a6c ("userfaultfd: untag
user pointers") plus some fixups to some additional calls to
validate_range that have appeared since then.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
Link: https://lkml.kernel.org/r/20210714195437.118982-1-pcc@google.com
Link: https://lkml.kernel.org/r/20210714195437.118982-2-pcc@google.com
Link: https://linux-review.googlesource.com/id/I761aa9f0344454c482b83fcfcce547db0a25501b
Fixes:
63f0c6037965 ("arm64: Introduce prctl() options to control the tagged user addresses ABI")
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Delva <adelva@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mitch Phillips <mitchp@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: William McVicker <willmcvicker@google.com>
Cc: <stable@vger.kernel.org> [5.4]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jisheng Zhang [Fri, 23 Jul 2021 00:22:26 +0000 (08:22 +0800)]
riscv: stacktrace: pin the task's stack in get_wchan
Pin the task's stack before calling walk_stackframe() in get_wchan().
This can fix the panic as reported by Andreas when CONFIG_VMAP_STACK=y:
[ 65.609696] Unable to handle kernel paging request at virtual address
ffffffd0003bbde8
[ 65.610460] Oops [#1]
[ 65.610626] Modules linked in: virtio_blk virtio_mmio rtc_goldfish btrfs blake2b_generic libcrc32c xor raid6_pq sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
[ 65.611670] CPU: 2 PID: 1 Comm: systemd Not tainted 5.14.0-rc1-1.g34fe32a-default #1 openSUSE Tumbleweed (unreleased)
c62f7109153e5a0897ee58ba52393ad99b070fd2
[ 65.612334] Hardware name: riscv-virtio,qemu (DT)
[ 65.613008] epc : get_wchan+0x5c/0x88
[ 65.613334] ra : get_wchan+0x42/0x88
[ 65.613625] epc :
ffffffff800048a4 ra :
ffffffff8000488a sp :
ffffffd00021bb90
[ 65.614008] gp :
ffffffff817709f8 tp :
ffffffe07fe91b80 t0 :
00000000000001f8
[ 65.614411] t1 :
0000000000020000 t2 :
0000000000000000 s0 :
ffffffd00021bbd0
[ 65.614818] s1 :
ffffffd0003bbdf0 a0 :
0000000000000001 a1 :
0000000000000002
[ 65.615237] a2 :
ffffffff81618008 a3 :
0000000000000000 a4 :
0000000000000000
[ 65.615637] a5 :
ffffffd0003bc000 a6 :
0000000000000002 a7 :
ffffffe27d370000
[ 65.616022] s2 :
ffffffd0003bbd90 s3 :
ffffffff8071a81e s4 :
0000000000003fff
[ 65.616407] s5 :
ffffffffffffc000 s6 :
0000000000000000 s7 :
ffffffff81618008
[ 65.616845] s8 :
0000000000000001 s9 :
0000000180000040 s10:
0000000000000000
[ 65.617248] s11:
000000000000016b t3 :
000000ff00000000 t4 :
0c6aec92de5e3fd7
[ 65.617672] t5 :
fff78f60608fcfff t6 :
0000000000000078
[ 65.618088] status:
0000000000000120 badaddr:
ffffffd0003bbde8 cause:
000000000000000d
[ 65.618621] [<
ffffffff800048a4>] get_wchan+0x5c/0x88
[ 65.619008] [<
ffffffff8022da88>] do_task_stat+0x7a2/0xa46
[ 65.619325] [<
ffffffff8022e87e>] proc_tgid_stat+0xe/0x16
[ 65.619637] [<
ffffffff80227dd6>] proc_single_show+0x46/0x96
[ 65.619979] [<
ffffffff801ccb1e>] seq_read_iter+0x190/0x31e
[ 65.620341] [<
ffffffff801ccd70>] seq_read+0xc4/0x104
[ 65.620633] [<
ffffffff801a6bfe>] vfs_read+0x6a/0x112
[ 65.620922] [<
ffffffff801a701c>] ksys_read+0x54/0xbe
[ 65.621206] [<
ffffffff801a7094>] sys_read+0xe/0x16
[ 65.621474] [<
ffffffff8000303e>] ret_from_syscall+0x0/0x2
[ 65.622169] ---[ end trace
f24856ed2b8789c5 ]---
[ 65.622832] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Jens Axboe [Fri, 23 Jul 2021 17:53:54 +0000 (11:53 -0600)]
io_uring: explicitly catch any illegal async queue attempt
Catch an illegal case to queue async from an unrelated task that got
the ring fd passed to it. This should not be possible to hit, but
better be proactive and catch it explicitly. io-wq is extended to
check for early IO_WQ_WORK_CANCEL being set on a work item as well,
so it can run the request through the normal cancelation path.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 23 Jul 2021 17:49:29 +0000 (11:49 -0600)]
io_uring: never attempt iopoll reissue from release path
There are two reasons why this shouldn't be done:
1) Ring is exiting, and we're canceling requests anyway. Any request
should be canceled anyway. In theory, this could iterate for a
number of times if someone else is also driving the target block
queue into request starvation, however the likelihood of this
happening is miniscule.
2) If the original task decided to pass the ring to another task, then
we don't want to be reissuing from this context as it may be an
unrelated task or context. No assumptions should be made about
the context in which ->release() is run. This can only happen for pure
read/write, and we'll get -EFAULT on them anyway.
Link: https://lore.kernel.org/io-uring/YPr4OaHv0iv0KTOc@zeniv-ca.linux.org.uk/
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Fri, 23 Jul 2021 19:49:07 +0000 (12:49 -0700)]
Merge tag 'for-5.14-rc2-tag' of git://git./linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few fixes and one patch to help some block layer API cleanups:
- skip missing device when running fstrim
- fix unpersisted i_size on fsync after expanding truncate
- fix lock inversion problem when doing qgroup extent tracing
- replace bdgrab/bdput usage, replace gendisk by block_device"
* tag 'for-5.14-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: store a block_device in struct btrfs_ordered_extent
btrfs: fix lock inversion problem when doing qgroup extent tracing
btrfs: check for missing device in btrfs_trim_fs
btrfs: fix unpersisted i_size on fsync after expanding truncate
Linus Torvalds [Fri, 23 Jul 2021 18:30:12 +0000 (11:30 -0700)]
Merge tag 'ceph-for-5.14-rc3' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"A subtle deadlock on lock_rwsem (marked for stable) and rbd fixes for
a -rc1 regression.
Also included a rare WARN condition tweak"
* tag 'ceph-for-5.14-rc3' of git://github.com/ceph/ceph-client:
rbd: resurrect setting of disk->private_data in rbd_init_disk()
ceph: don't WARN if we're still opening a session to an MDS
rbd: don't hold lock_rwsem while running_list is being drained
rbd: always kick acquire on "acquired" and "released" notifications
Linus Torvalds [Fri, 23 Jul 2021 18:25:21 +0000 (11:25 -0700)]
Merge tag 'trace-v5.14-rc2' of git://git./linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix deadloop in ring buffer because of using stale "read" variable
- Fix synthetic event use of field_pos as boolean and not an index
- Fixed histogram special var "cpu" overriding event fields called
"cpu"
- Cleaned up error prone logic in alloc_synth_event()
- Removed call to synchronize_rcu_tasks_rude() when not needed
- Removed redundant initialization of a local variable "ret"
- Fixed kernel crash when updating tracepoint callbacks of different
priorities.
* tag 'trace-v5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracepoints: Update static_call before tp_funcs when adding a tracepoint
ftrace: Remove redundant initialization of variable ret
ftrace: Avoid synchronize_rcu_tasks_rude() call when not necessary
tracing: Clean up alloc_synth_event()
tracing/histogram: Rename "cpu" to "common_cpu"
tracing: Synthetic event field_pos is an index not a boolean
tracing: Fix bug in rb_per_cpu_empty() that might cause deadloop.