Christoph Hellwig [Mon, 7 Sep 2020 05:58:18 +0000 (07:58 +0200)]
uaccess: provide a generic TASK_SIZE_MAX definition
Define TASK_SIZE_MAX as TASK_SIZE if not otherwise defined.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Palmer Dabbelt [Sun, 4 Oct 2020 17:14:53 +0000 (10:14 -0700)]
Merge branch 'base.set_fs' of git://git./linux/kernel/git/viro/vfs into for-next
This is a dependency for Christoph's removal of set_fs.
* 'base.set_fs' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
powerpc: remove address space overrides using set_fs()
powerpc: use non-set_fs based maccess routines
x86: remove address space overrides using set_fs()
x86: make TASK_SIZE_MAX usable from assembly code
x86: move PAGE_OFFSET, TASK_SIZE & friends to page_{32,64}_types.h
lkdtm: remove set_fs-based tests
test_bitmap: remove user bitmap tests
uaccess: add infrastructure for kernel builds with set_fs()
fs: don't allow splice read/write without explicit ops
fs: don't allow kernel reads and writes without iter ops
sysctl: Convert to iter interfaces
proc: add a read_iter method to proc proc_ops
proc: cleanup the compat vs no compat file ops
proc: remove a level of indentation in proc_get_inode
Atish Patra [Thu, 17 Sep 2020 22:37:16 +0000 (15:37 -0700)]
RISC-V: Add page table dump support for uefi
Extend the current page table dump support in RISC-V to include efi
pages as well.
Here is the output of efi runtime page table mappings.
---[ UEFI runtime start ]---
0x0000000020002000-0x0000000020003000 0x00000000be732000 4K PTE D A . . . W R V
0x0000000020018000-0x0000000020019000 0x00000000be738000 4K PTE D A . . . W R V
0x000000002002c000-0x000000002002d000 0x00000000be73c000 4K PTE D A . . . W R V
0x0000000020031000-0x0000000020032000 0x00000000bff61000 4K PTE D A . . X W R V
---[ UEFI runtime end ]---
Signed-off-by: Atish Patra <atish.patra@wdc.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Atish Patra [Thu, 17 Sep 2020 22:37:15 +0000 (15:37 -0700)]
RISC-V: Add EFI runtime services
This patch adds EFI runtime service support for RISC-V.
Signed-off-by: Atish Patra <atish.patra@wdc.com>
[ardb: - Remove the page check]
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Atish Patra [Thu, 17 Sep 2020 22:37:14 +0000 (15:37 -0700)]
RISC-V: Add EFI stub support.
Add a RISC-V architecture specific stub code that actually copies the
actual kernel image to a valid address and jump to it after boot services
are terminated. Enable UEFI related kernel configs as well for RISC-V.
Signed-off-by: Atish Patra <atish.patra@wdc.com>
Link: https://lore.kernel.org/r/20200421033336.9663-4-atish.patra@wdc.com
[ardb: - move hartid fetch into check_platform_features()
- use image_size not reserve_size
- select ISA_C
- do not use dram_base]
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Atish Patra [Thu, 17 Sep 2020 22:37:13 +0000 (15:37 -0700)]
RISC-V: Add PE/COFF header for EFI stub
Linux kernel Image can appear as an EFI application With appropriate
PE/COFF header fields in the beginning of the Image header. An EFI
application loader can directly load a Linux kernel Image and an EFI
stub residing in kernel can boot Linux kernel directly.
Add the necessary PE/COFF header.
Signed-off-by: Atish Patra <atish.patra@wdc.com>
Link: https://lore.kernel.org/r/20200421033336.9663-3-atish.patra@wdc.com
[ardb: - use C prefix for c.li to ensure the expected opcode is emitted
- align all image sections according to PE/COFF section alignment ]
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Atish Patra [Thu, 17 Sep 2020 22:37:12 +0000 (15:37 -0700)]
RISC-V: Implement late mapping page table allocation functions
Currently, page table setup is done during setup_va_final where fixmap can
be used to create the temporary mappings. The physical frame is allocated
from memblock_alloc_* functions. However, this won't work if page table
mapping needs to be created for a different mm context (i.e. efi mm) at
a later point of time.
Use generic kernel page allocation function & macros for any mapping
after setup_vm_final.
Signed-off-by: Atish Patra <atish.patra@wdc.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Atish Patra [Thu, 17 Sep 2020 22:37:11 +0000 (15:37 -0700)]
RISC-V: Add early ioremap support
UEFI uses early IO or memory mappings for runtime services before
normal ioremap() is usable. Add the necessary fixmap bindings and
pmd mappings for generic ioremap support to work.
Signed-off-by: Atish Patra <atish.patra@wdc.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Anup Patel [Thu, 17 Sep 2020 22:37:10 +0000 (15:37 -0700)]
RISC-V: Move DT mapping outof fixmap
Currently, RISC-V reserves 1MB of fixmap memory for device tree. However,
it maps only single PMD (2MB) space for fixmap which leaves only < 1MB space
left for other kernel features such as early ioremap which requires fixmap
as well. The fixmap size can be increased by another 2MB but it brings
additional complexity and changes the virtual memory layout as well.
If we require some additional feature requiring fixmap again, it has to be
moved again.
Technically, DT doesn't need a fixmap as the memory occupied by the DT is
only used during boot. That's why, We map device tree in early page table
using two consecutive PGD mappings at lower addresses (< PAGE_OFFSET).
This frees lot of space in fixmap and also makes maximum supported
device tree size supported as PGDIR_SIZE. Thus, init memory section can be used
for the same purpose as well. This simplifies fixmap implementation.
Signed-off-by: Anup Patel <anup.patel@wdc.com>
Signed-off-by: Atish Patra <atish.patra@wdc.com>
Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Palmer Dabbelt [Fri, 2 Oct 2020 21:29:51 +0000 (14:29 -0700)]
Merge tag 'efi-riscv-shared-for-v5.10' of ssh://gitolite./linux/kernel/git/efi/efi into for-next
Stable branch for v5.10 shared between the EFI and RISC-V trees
The RISC-V EFI boot and runtime support will be merged for v5.10 via
the RISC-V tree. However, it incorporates some changes that conflict
with other EFI changes that are in flight, so this tag serves as a
shared base that allows those conflicts to be resolved beforehand.
* tag 'efi-riscv-shared-for-v5.10' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efi/libstub: arm32: Use low allocation for the uncompressed kernel
efi/libstub: Export efi_low_alloc_above() to other units
efi/libstub: arm32: Base FDT and initrd placement on image address
efi: Rename arm-init to efi-init common for all arch
include: pe.h: Add RISC-V related PE definition
Ard Biesheuvel [Wed, 9 Sep 2020 14:11:50 +0000 (17:11 +0300)]
efi/libstub: arm32: Use low allocation for the uncompressed kernel
Before commit
d0f9ca9be11f25ef ("ARM: decompressor: run decompressor in place if loaded via UEFI")
we were rather limited in the choice of base address for the uncompressed
kernel, as we were relying on the logic in the decompressor that blindly
rounds down the decompressor execution address to the next multiple of 128
MiB, and decompresses the kernel there. For this reason, we have a lot of
complicated memory region handling code, to ensure that this memory window
is available, even though it could be occupied by reserved regions or
other allocations that may or may not collide with the uncompressed image.
Today, we simply pass the target address for the decompressed image to the
decompressor directly, and so we can choose a suitable window just by
finding a 16 MiB aligned region, while taking TEXT_OFFSET and the region
for the swapper page tables into account.
So let's get rid of the complicated logic, and instead, use the existing
bottom up allocation routine to allocate a suitable window as low as
possible, and carve out a memory region that has the right properties.
Note that this removes any dependencies on the 'dram_base' argument to
handle_kernel_image(), and so this is removed as well. Given that this
was the only remaining use of dram_base, the code that produces it is
removed entirely as well.
Reviewed-by: Maxim Uvarov <maxim.uvarov@linaro.org>
Tested-by: Maxim Uvarov <maxim.uvarov@linaro.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Ard Biesheuvel [Wed, 9 Sep 2020 13:16:20 +0000 (16:16 +0300)]
efi/libstub: Export efi_low_alloc_above() to other units
Permit arm32-stub.c to access efi_low_alloc_above() in a subsequent
patch by giving it external linkage and declaring it in efistub.h.
Reviewed-by: Maxim Uvarov <maxim.uvarov@linaro.org>
Tested-by: Maxim Uvarov <maxim.uvarov@linaro.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Ard Biesheuvel [Thu, 10 Sep 2020 14:09:45 +0000 (17:09 +0300)]
efi/libstub: arm32: Base FDT and initrd placement on image address
The way we use the base of DRAM in the EFI stub is problematic as it
is ill defined what the base of DRAM actually means. There are some
restrictions on the placement of FDT and initrd which are defined in
terms of dram_base, but given that the placement of the kernel in
memory is what defines these boundaries (as on ARM, this is where the
linear region starts), it is better to use the image address in these
cases, and disregard dram_base altogether.
Reviewed-by: Maxim Uvarov <maxim.uvarov@linaro.org>
Tested-by: Maxim Uvarov <maxim.uvarov@linaro.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Tian Tao [Mon, 14 Sep 2020 00:52:02 +0000 (08:52 +0800)]
RISC-V: Fix duplicate included thread_info.h
asm/thread_info.h is included more than once, Remove the one that isn't
necessary.
Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Pekka Enberg [Sat, 5 Sep 2020 06:07:26 +0000 (09:07 +0300)]
riscv/mm/fault: Set FAULT_FLAG_INSTRUCTION flag in do_page_fault()
If the page fault "cause" is EXC_INST_PAGE_FAULT, set the
FAULT_FLAG_INSTRUCTION flag to let handle_mm_fault() and friends know
about it. This has no functional changes because RISC-V uses the default
arch_vma_access_permitted() implementation, which always returns true.
However, dax_pmd_fault(), for example, has a tracepoint that uses
FAULT_FLAG_INSTRUCTION, so we might as well set it.
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Pekka Enberg [Sat, 5 Sep 2020 05:52:52 +0000 (08:52 +0300)]
riscv/mm/fault: Fix inline placement in vmalloc_fault() declaration
The "inline" keyword is in the wrong place in vmalloc_fault()
declaration:
>> arch/riscv/mm/fault.c:56:1: warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
56 | static void inline vmalloc_fault(struct pt_regs *regs, int code, unsigned long addr)
| ^~~~~~
Fix that up.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Zong Li [Mon, 31 Aug 2020 07:33:50 +0000 (15:33 +0800)]
riscv: Add cache information in AUX vector
There are no standard CSR registers to provide cache information, the
way for RISC-V is to get this information from DT. Currently, AT_L1I_X,
AT_L1D_X and AT_L2_X are present in glibc header, and sysconf syscall
could use them to get information of cache through AUX vector.
The result of 'getconf -a' as follows:
LEVEL1_ICACHE_SIZE 32768
LEVEL1_ICACHE_ASSOC 8
LEVEL1_ICACHE_LINESIZE 64
LEVEL1_DCACHE_SIZE 32768
LEVEL1_DCACHE_ASSOC 8
LEVEL1_DCACHE_LINESIZE 64
LEVEL2_CACHE_SIZE 2097152
LEVEL2_CACHE_ASSOC 32
LEVEL2_CACHE_LINESIZE 64
Signed-off-by: Zong Li <zong.li@sifive.com>
Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Zong Li [Mon, 31 Aug 2020 07:33:49 +0000 (15:33 +0800)]
riscv: Define AT_VECTOR_SIZE_ARCH for ARCH_DLINFO
AT_VECTOR_SIZE_ARCH should be defined with the maximum number of
NEW_AUX_ENT entries that ARCH_DLINFO can contain, but it wasn't defined
for RISC-V at all even though ARCH_DLINFO will contain one NEW_AUX_ENT
for the VDSO address.
Signed-off-by: Zong Li <zong.li@sifive.com>
Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Zong Li [Mon, 31 Aug 2020 07:33:48 +0000 (15:33 +0800)]
riscv: Set more data to cacheinfo
Set cacheinfo.{size,sets,line_size} for each cache node, then we can
get these information from userland through auxiliary vector.
Signed-off-by: Zong Li <zong.li@sifive.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Pekka Enberg [Tue, 25 Aug 2020 16:41:17 +0000 (19:41 +0300)]
riscv/mm/fault: Move access error check to function
Move the access error check into a access_error() function to simplify
the control flow in do_page_fault().
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Pekka Enberg [Tue, 25 Aug 2020 16:42:54 +0000 (19:42 +0300)]
riscv/mm/fault: Move FAULT_FLAG_WRITE handling in do_page_fault()
Let's handle the translation of EXC_STORE_PAGE_FAULT to FAULT_FLAG_WRITE
once before looking up the VMA. This makes it easier to extract access
error logic in the next patch.
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Pekka Enberg [Tue, 25 Aug 2020 16:05:44 +0000 (19:05 +0300)]
riscv/mm/fault: Simplify mm_fault_error()
Simplify the mm_fault_error() handling function by eliminating the
unnecessary gotos.
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Pekka Enberg [Tue, 25 Aug 2020 16:04:09 +0000 (19:04 +0300)]
riscv/mm/fault: Move fault error handling to mm_fault_error()
This patch moves the fault error handling to mm_fault_error() function
and converts gotos to calls to the new function.
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Pekka Enberg [Tue, 25 Aug 2020 16:01:43 +0000 (19:01 +0300)]
riscv/mm/fault: Simplify fault error handling
Move fault error handling after retry logic. This simplifies the code
flow and makes it easier to move fault error handling to its own
function.
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Pekka Enberg [Tue, 25 Aug 2020 15:54:26 +0000 (18:54 +0300)]
riscv/mm/fault: Move vmalloc fault handling to vmalloc_fault()
This patch moves the vmalloc fault handling in do_page_fault() to
vmalloc_fault() function and converts gotos to calls to the new
function.
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Pekka Enberg [Tue, 25 Aug 2020 15:48:01 +0000 (18:48 +0300)]
riscv/mm/fault: Move bad area handling to bad_area()
This patch moves the bad area handling in do_page_fault() to bad_area()
function and converts gotos to calls to the new function.
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Pekka Enberg [Tue, 25 Aug 2020 15:38:58 +0000 (18:38 +0300)]
riscv/mm/fault: Move no context handling to no_context()
This patch moves the no context handling in do_page_fault() to
no_context() function and converts gotos to calls to the new function.
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Pekka Enberg [Wed, 19 Aug 2020 14:10:11 +0000 (17:10 +0300)]
riscv/mm: Simplify retry logic in do_page_fault()
Let's combine the two retry logic if statements in do_page_fault() to
simplify the code.
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Atish Patra [Wed, 19 Aug 2020 22:24:23 +0000 (15:24 -0700)]
efi: Rename arm-init to efi-init common for all arch
arm-init is responsible for setting up efi runtime and doesn't actually
do any ARM specific stuff. RISC-V can use the same source code as it is.
Rename it to efi-init so that RISC-V can use it.
Signed-off-by: Atish Patra <atish.patra@wdc.com>
Link: https://lore.kernel.org/r/20200819222425.30721-8-atish.patra@wdc.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Atish Patra [Fri, 28 Aug 2020 17:20:31 +0000 (10:20 -0700)]
include: pe.h: Add RISC-V related PE definition
Define RISC-V related machine types.
Signed-off-by: Atish Patra <atish.patra@wdc.com>
Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>
Link: https://lore.kernel.org/r/20200415195422.19866-3-atish.patra@wdc.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Christoph Hellwig [Thu, 3 Sep 2020 14:22:42 +0000 (16:22 +0200)]
powerpc: remove address space overrides using set_fs()
Stop providing the possibility to override the address space using
set_fs() now that there is no need for that any more.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Thu, 3 Sep 2020 14:22:41 +0000 (16:22 +0200)]
powerpc: use non-set_fs based maccess routines
Provide __get_kernel_nofault and __put_kernel_nofault routines to
implement the maccess routines without messing with set_fs and without
opening up access to user space.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Thu, 3 Sep 2020 14:22:40 +0000 (16:22 +0200)]
x86: remove address space overrides using set_fs()
Stop providing the possibility to override the address space using
set_fs() now that there is no need for that any more. To properly
handle the TASK_SIZE_MAX checking for 4 vs 5-level page tables on
x86 a new alternative is introduced, which just like the one in
entry_64.S has to use the hardcoded virtual address bits to escape
the fact that TASK_SIZE_MAX isn't actually a constant when 5-level
page tables are enabled.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Thu, 3 Sep 2020 14:22:39 +0000 (16:22 +0200)]
x86: make TASK_SIZE_MAX usable from assembly code
For 64-bit the only thing missing was a strategic _AC, and for 32-bit we
need to use __PAGE_OFFSET instead of PAGE_OFFSET in the TASK_SIZE
definition to escape the explicit unsigned long cast. This just works
because __PAGE_OFFSET is defined using _AC itself and thus never needs
the cast anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Thu, 3 Sep 2020 14:22:38 +0000 (16:22 +0200)]
x86: move PAGE_OFFSET, TASK_SIZE & friends to page_{32,64}_types.h
At least for 64-bit this moves them closer to some of the defines
they are based on, and it prepares for using the TASK_SIZE_MAX
definition from assembly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Thu, 3 Sep 2020 14:22:37 +0000 (16:22 +0200)]
lkdtm: remove set_fs-based tests
Once we can't manipulate the address limit, we also can't test what
happens when the manipulation is abused.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Thu, 3 Sep 2020 14:22:36 +0000 (16:22 +0200)]
test_bitmap: remove user bitmap tests
We can't run the tests for userspace bitmap parsing if set_fs() doesn't
exist, and it is about to go away for x86, powerpc with other major
architectures to follow.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Thu, 3 Sep 2020 14:22:35 +0000 (16:22 +0200)]
uaccess: add infrastructure for kernel builds with set_fs()
Add a CONFIG_SET_FS option that is selected by architecturess that
implement set_fs, which is all of them initially. If the option is not
set stubs for routines related to overriding the address space are
provided so that architectures can start to opt out of providing set_fs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Thu, 3 Sep 2020 14:22:34 +0000 (16:22 +0200)]
fs: don't allow splice read/write without explicit ops
default_file_splice_write is the last piece of generic code that uses
set_fs to make the uaccess routines operate on kernel pointers. It
implements a "fallback loop" for splicing from files that do not actually
provide a proper splice_read method. The usual file systems and other
high bandwidth instances all provide a ->splice_read, so this just removes
support for various device drivers and procfs/debugfs files. If splice
support for any of those turns out to be important it can be added back
by switching them to the iter ops and using generic_file_splice_read.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Thu, 3 Sep 2020 14:22:33 +0000 (16:22 +0200)]
fs: don't allow kernel reads and writes without iter ops
Don't allow calling ->read or ->write with set_fs as a preparation for
killing off set_fs. All the instances that we use kernel_read/write on
are using the iter ops already.
If a file has both the regular ->read/->write methods and the iter
variants those could have different semantics for messed up enough
drivers. Also fails the kernel access to them in that case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Matthew Wilcox (Oracle) [Thu, 3 Sep 2020 14:22:32 +0000 (16:22 +0200)]
sysctl: Convert to iter interfaces
Using the read_iter/write_iter interfaces allows for in-kernel users
to set sysctls without using set_fs(). Also, the buffer is a string,
so give it the real type of 'char *', not void *.
[AV: Christoph's fixup folded in]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Thu, 3 Sep 2020 14:22:31 +0000 (16:22 +0200)]
proc: add a read_iter method to proc proc_ops
This will allow proc files to implement iter read semantics.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Thu, 3 Sep 2020 14:22:30 +0000 (16:22 +0200)]
proc: cleanup the compat vs no compat file ops
Instead of providing a special no-compat version provide a special
compat version for operations with ->compat_ioctl.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Thu, 3 Sep 2020 14:22:29 +0000 (16:22 +0200)]
proc: remove a level of indentation in proc_get_inode
Just return early on inode allocation failure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Linus Torvalds [Sun, 23 Aug 2020 21:08:43 +0000 (14:08 -0700)]
Linux 5.9-rc2
Linus Torvalds [Sun, 23 Aug 2020 18:37:23 +0000 (11:37 -0700)]
Merge tag 'powerpc-5.9-3' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Add perf support for emitting extended registers for power10.
- A fix for CPU hotplug on pseries, where on large/loaded systems we
may not wait long enough for the CPU to be offlined, leading to
crashes.
- Addition of a raw cputable entry for Power10, which is not required
to boot, but is required to make our PMU setup work correctly in
guests.
- Three fixes for the recent changes on 32-bit Book3S to move modules
into their own segment for strict RWX.
- A fix for a recent change in our powernv PCI code that could lead to
crashes.
- A change to our perf interrupt accounting to avoid soft lockups when
using some events, found by syzkaller.
- A change in the way we handle power loss events from the hypervisor
on pseries. We no longer immediately shut down if we're told we're
running on a UPS.
- A few other minor fixes.
Thanks to Alexey Kardashevskiy, Andreas Schwab, Aneesh Kumar K.V, Anju T
Sudhakar, Athira Rajeev, Christophe Leroy, Frederic Barrat, Greg Kurz,
Kajol Jain, Madhavan Srinivasan, Michael Neuling, Michael Roth,
Nageswara R Sastry, Oliver O'Halloran, Thiago Jung Bauermann,
Vaidyanathan Srinivasan, Vasant Hegde.
* tag 'powerpc-5.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/perf/hv-24x7: Move cpumask file to top folder of hv-24x7 driver
powerpc/32s: Fix module loading failure when VMALLOC_END is over 0xf0000000
powerpc/pseries: Do not initiate shutdown when system is running on UPS
powerpc/perf: Fix soft lockups due to missed interrupt accounting
powerpc/powernv/pci: Fix possible crash when releasing DMA resources
powerpc/pseries/hotplug-cpu: wait indefinitely for vCPU death
powerpc/32s: Fix is_module_segment() when MODULES_VADDR is defined
powerpc/kasan: Fix KASAN_SHADOW_START on BOOK3S_32
powerpc/fixmap: Fix the size of the early debug area
powerpc/pkeys: Fix build error with PPC_MEM_KEYS disabled
powerpc/kernel: Cleanup machine check function declarations
powerpc: Add POWER10 raw mode cputable entry
powerpc/perf: Add extended regs support for power10 platform
powerpc/perf: Add support for outputting extended regs in perf intr_regs
powerpc: Fix P10 PVR revision in /proc/cpuinfo for SMT4 cores
Linus Torvalds [Sun, 23 Aug 2020 18:21:16 +0000 (11:21 -0700)]
Merge tag 'x86-urgent-2020-08-23' of git://git./linux/kernel/git/tip/tip
Pull x86 fix from Thomas Gleixner:
"A single fix for x86 which removes the RDPID usage from the paranoid
entry path and unconditionally uses LSL to retrieve the CPU number.
RDPID depends on MSR_TSX_AUX. KVM has an optmization to avoid
expensive MRS read/writes on VMENTER/EXIT. It caches the MSR values
and restores them either when leaving the run loop, on preemption or
when going out to user space. MSR_TSX_AUX is part of that lazy MSR
set, so after writing the guest value and before the lazy restore any
exception using the paranoid entry will read the guest value and use
it as CPU number to retrieve the GSBASE value for the current CPU when
FSGSBASE is enabled. As RDPID is only used in that particular entry
path, there is no reason to burden VMENTER/EXIT with two extra MSR
writes. Remove the RDPID optimization, which is not even backed by
numbers from the paranoid entry path instead"
* tag 'x86-urgent-2020-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry/64: Do not use RDPID in paranoid entry to accomodate KVM
Linus Torvalds [Sun, 23 Aug 2020 18:15:14 +0000 (11:15 -0700)]
Merge tag 'perf-urgent-2020-08-23' of git://git./linux/kernel/git/tip/tip
Pull x86 perf fix from Thomas Gleixner:
"A single update for perf on x86 which has support for the broken down
bandwith counters"
* tag 'perf-urgent-2020-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Add BW counters for GT, IA and IO breakdown
Linus Torvalds [Sun, 23 Aug 2020 18:08:32 +0000 (11:08 -0700)]
Merge tag 'efi-urgent-2020-08-23' of git://git./linux/kernel/git/tip/tip
Pull EFI fixes from Thomas Gleixner:
- Enforce NX on RO data in mixed EFI mode
- Destroy workqueue in an error handling path to prevent UAF
- Stop argument parser at '--' which is the delimiter for init
- Treat a NULL command line pointer as empty instead of dereferncing it
unconditionally.
- Handle an unterminated command line correctly
- Cleanup the 32bit code leftovers and remove obsolete documentation
* tag 'efi-urgent-2020-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation: efi: remove description of efi=old_map
efi/x86: Move 32-bit code into efi_32.c
efi/libstub: Handle unterminated cmdline
efi/libstub: Handle NULL cmdline
efi/libstub: Stop parsing arguments at "--"
efi: add missed destroy_workqueue when efisubsys_init fails
efi/x86: Mark kernel rodata non-executable for mixed mode
Linus Torvalds [Sun, 23 Aug 2020 18:05:47 +0000 (11:05 -0700)]
Merge tag 'core-urgent-2020-08-23' of git://git./linux/kernel/git/tip/tip
Pull entry fix from Thomas Gleixner:
"A single bug fix for the common entry code.
The transcription of the x86 version messed up the reload of the
syscall number from pt_regs after ptrace and seccomp which breaks
syscall number rewriting"
* tag 'core-urgent-2020-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
core/entry: Respect syscall number rewrites
Linus Torvalds [Sun, 23 Aug 2020 17:57:19 +0000 (10:57 -0700)]
Merge tag 'edac_urgent_for_v5.9_rc2' of git://git./linux/kernel/git/ras/ras
Pull EDAC fix from Borislav Petkov:
"A single fix correcting a reversed error severity determination check
which lead to a recoverable error getting marked as fatal, by Tony
Luck"
* tag 'edac_urgent_for_v5.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/{i7core,sb,pnd2,skx}: Fix error event severity
Linus Torvalds [Sun, 23 Aug 2020 17:52:33 +0000 (10:52 -0700)]
Merge git://git./linux/kernel/git/netdev/net
Pull networking fixes from David Miller:
"Nothing earth shattering here, lots of small fixes (f.e. missing RCU
protection, bad ref counting, missing memset(), etc.) all over the
place:
1) Use get_file_rcu() in task_file iterator, from Yonghong Song.
2) There are two ways to set remote source MAC addresses in macvlan
driver, but only one of which validates things properly. Fix this.
From Alvin Å ipraga.
3) Missing of_node_put() in gianfar probing, from Sumera
Priyadarsini.
4) Preserve device wanted feature bits across multiple netlink
ethtool requests, from Maxim Mikityanskiy.
5) Fix rcu_sched stall in task and task_file bpf iterators, from
Yonghong Song.
6) Avoid reset after device destroy in ena driver, from Shay
Agroskin.
7) Missing memset() in netlink policy export reallocation path, from
Johannes Berg.
8) Fix info leak in __smc_diag_dump(), from Peilin Ye.
9) Decapsulate ECN properly for ipv6 in ipv4 tunnels, from Mark
Tomlinson.
10) Fix number of data stream negotiation in SCTP, from David Laight.
11) Fix double free in connection tracker action module, from Alaa
Hleihel.
12) Don't allow empty NHA_GROUP attributes, from Nikolay Aleksandrov"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (46 commits)
net: nexthop: don't allow empty NHA_GROUP
bpf: Fix two typos in uapi/linux/bpf.h
net: dsa: b53: check for timeout
tipc: call rcu_read_lock() in tipc_aead_encrypt_done()
net/sched: act_ct: Fix skb double-free in tcf_ct_handle_fragments() error flow
net: sctp: Fix negotiation of the number of data streams.
dt-bindings: net: renesas, ether: Improve schema validation
gre6: Fix reception with IP6_TNL_F_RCV_DSCP_COPY
hv_netvsc: Fix the queue_mapping in netvsc_vf_xmit()
hv_netvsc: Remove "unlikely" from netvsc_select_queue
bpf: selftests: global_funcs: Check err_str before strstr
bpf: xdp: Fix XDP mode when no mode flags specified
selftests/bpf: Remove test_align leftovers
tools/resolve_btfids: Fix sections with wrong alignment
net/smc: Prevent kernel-infoleak in __smc_diag_dump()
sfc: fix build warnings on 32-bit
net: phy: mscc: Fix a couple of spelling mistakes "spcified" -> "specified"
libbpf: Fix map index used in error message
net: gemini: Fix missing free_netdev() in error path of gemini_ethernet_port_probe()
net: atlantic: Use readx_poll_timeout() for large timeout
...
Linus Torvalds [Sun, 23 Aug 2020 00:11:38 +0000 (17:11 -0700)]
Merge branch 'work.epoll' of git://git./linux/kernel/git/viro/vfs
Pull epoll fixes from Al Viro:
"Fix reference counting and clean up exit paths"
* 'work.epoll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
do_epoll_ctl(): clean the failure exits up a bit
epoll: Keep a reference on files added to the check list
Al Viro [Sat, 22 Aug 2020 22:25:52 +0000 (18:25 -0400)]
do_epoll_ctl(): clean the failure exits up a bit
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Marc Zyngier [Wed, 19 Aug 2020 16:12:17 +0000 (17:12 +0100)]
epoll: Keep a reference on files added to the check list
When adding a new fd to an epoll, and that this new fd is an
epoll fd itself, we recursively scan the fds attached to it
to detect cycles, and add non-epool files to a "check list"
that gets subsequently parsed.
However, this check list isn't completely safe when deletions
can happen concurrently. To sidestep the issue, make sure that
a struct file placed on the check list sees its f_count increased,
ensuring that a concurrent deletion won't result in the file
disapearing from under our feet.
Cc: stable@vger.kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Nikolay Aleksandrov [Sat, 22 Aug 2020 12:06:36 +0000 (15:06 +0300)]
net: nexthop: don't allow empty NHA_GROUP
Currently the nexthop code will use an empty NHA_GROUP attribute, but it
requires at least 1 entry in order to function properly. Otherwise we
end up derefencing null or random pointers all over the place due to not
having any nh_grp_entry members allocated, nexthop code relies on having at
least the first member present. Empty NHA_GROUP doesn't make any sense so
just disallow it.
Also add a WARN_ON for any future users of nexthop_create_group().
BUG: kernel NULL pointer dereference, address:
0000000000000080
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] SMP
CPU: 0 PID: 558 Comm: ip Not tainted 5.9.0-rc1+ #93
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
RIP: 0010:fib_check_nexthop+0x4a/0xaa
Code: 0f 84 83 00 00 00 48 c7 02 80 03 f7 81 c3 40 80 fe fe 75 12 b8 ea ff ff ff 48 85 d2 74 6b 48 c7 02 40 03 f7 81 c3 48 8b 40 10 <48> 8b 80 80 00 00 00 eb 36 80 78 1a 00 74 12 b8 ea ff ff ff 48 85
RSP: 0018:
ffff88807983ba00 EFLAGS:
00010213
RAX:
0000000000000000 RBX:
ffff88807983bc00 RCX:
0000000000000000
RDX:
ffff88807983bc00 RSI:
0000000000000000 RDI:
ffff88807bdd0a80
RBP:
ffff88807983baf8 R08:
0000000000000dc0 R09:
000000000000040a
R10:
0000000000000000 R11:
ffff88807bdd0ae8 R12:
0000000000000000
R13:
0000000000000000 R14:
ffff88807bea3100 R15:
0000000000000001
FS:
00007f10db393700(0000) GS:
ffff88807dc00000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
0000000000000080 CR3:
000000007bd0f004 CR4:
00000000003706f0
Call Trace:
fib_create_info+0x64d/0xaf7
fib_table_insert+0xf6/0x581
? __vma_adjust+0x3b6/0x4d4
inet_rtm_newroute+0x56/0x70
rtnetlink_rcv_msg+0x1e3/0x20d
? rtnl_calcit.isra.0+0xb8/0xb8
netlink_rcv_skb+0x5b/0xac
netlink_unicast+0xfa/0x17b
netlink_sendmsg+0x334/0x353
sock_sendmsg_nosec+0xf/0x3f
____sys_sendmsg+0x1a0/0x1fc
? copy_msghdr_from_user+0x4c/0x61
___sys_sendmsg+0x63/0x84
? handle_mm_fault+0xa39/0x11b5
? sockfd_lookup_light+0x72/0x9a
__sys_sendmsg+0x50/0x6e
do_syscall_64+0x54/0xbe
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f10dacc0bb7
Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb cd 66 0f 1f 44 00 00 8b 05 9a 4b 2b 00 85 c0 75 2e 48 63 ff 48 63 d2 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 b1 f2 2a 00 f7 d8 64 89 02 48
RSP: 002b:
00007ffcbe628bf8 EFLAGS:
00000246 ORIG_RAX:
000000000000002e
RAX:
ffffffffffffffda RBX:
00007ffcbe628f80 RCX:
00007f10dacc0bb7
RDX:
0000000000000000 RSI:
00007ffcbe628c60 RDI:
0000000000000003
RBP:
000000005f41099c R08:
0000000000000001 R09:
0000000000000008
R10:
00000000000005e9 R11:
0000000000000246 R12:
0000000000000000
R13:
0000000000000000 R14:
00007ffcbe628d70 R15:
0000563a86c6e440
Modules linked in:
CR2:
0000000000000080
CC: David Ahern <dsahern@gmail.com>
Fixes:
430a049190de ("nexthop: Add support for nexthop groups")
Reported-by: syzbot+a61aa19b0c14c8770bd9@syzkaller.appspotmail.com
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sat, 22 Aug 2020 17:22:44 +0000 (10:22 -0700)]
Merge tag 'kbuild-fixes-v5.9' of git://git./linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- move -Wsign-compare warning from W=2 to W=3
- fix the keyword _restrict to __restrict in genksyms
- fix more bugs in qconf
* tag 'kbuild-fixes-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kconfig: qconf: replace deprecated QString::sprintf() with QTextStream
kconfig: qconf: remove redundant help in the info view
kconfig: qconf: remove qInfo() to get back Qt4 support
kconfig: qconf: remove unused colNr
kconfig: qconf: fix the popup menu in the ConfigInfoView window
kconfig: qconf: fix signal connection to invalid slots
genksyms: keywords: Use __restrict not _restrict
kbuild: remove redundant patterns in filter/filter-out
extract-cert: add static to local data
Makefile.extrawarn: Move sign-compare from W=2 to W=3
Linus Torvalds [Sat, 22 Aug 2020 17:17:36 +0000 (10:17 -0700)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Allow booting of late secondary CPUs affected by erratum 1418040
(currently they are parked if none of the early CPUs are affected by
this erratum).
- Add the 32-bit vdso Makefile to the vdso_install rule so that 'make
vdso_install' installs the 32-bit compat vdso when it is compiled.
- Print a warning that untrusted guests without a CPU erratum
workaround (Cortex-A57 832075) may deadlock the affected system.
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
ARM64: vdso32: Install vdso32 from vdso_install
KVM: arm64: Print warning when cpu erratum can cause guests to deadlock
arm64: Allow booting of late CPUs affected by erratum 1418040
arm64: Move handling of erratum 1418040 into C code
Linus Torvalds [Sat, 22 Aug 2020 17:12:49 +0000 (10:12 -0700)]
Merge tag 's390-5.9-3' of git://git./linux/kernel/git/s390/linux
Pull s390 fixes from Vasily Gorbik:
- a couple of fixes for storage key handling relevant for debugging
- add cond_resched into potentially slow subchannels scanning loop
- fixes for PF/VF linking and to ignore stale PCI configuration request
events
* tag 's390-5.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/pci: fix PF/VF linking on hot plug
s390/pci: re-introduce zpci_remove_device()
s390/pci: fix zpci_bus_link_virtfn()
s390/ptrace: fix storage key handling
s390/runtime_instrumentation: fix storage key handling
s390/pci: ignore stale configuration request event
s390/cio: add cond_resched() in the slow_eval_known_fn() loop
Linus Torvalds [Sat, 22 Aug 2020 17:03:05 +0000 (10:03 -0700)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
- PAE and PKU bugfixes for x86
- selftests fix for new binutils
- MMU notifier fix for arm64
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: arm64: Only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is not set
KVM: Pass MMU notifier range flags to kvm_unmap_hva_range()
kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE mode
kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE mode
KVM: x86: fix access code passed to gva_to_gpa
selftests: kvm: Use a shorter encoding to clear RAX
Linus Torvalds [Sat, 22 Aug 2020 16:56:42 +0000 (09:56 -0700)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"23 fixes in 5 drivers (qla2xxx, ufs, scsi_debug, fcoe, zfcp). The bulk
of the changes are in qla2xxx and ufs and all are mostly small and
definitely don't impact the core"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (23 commits)
Revert "scsi: qla2xxx: Disable T10-DIF feature with FC-NVMe during probe"
Revert "scsi: qla2xxx: Fix crash on qla2x00_mailbox_command"
scsi: qla2xxx: Fix null pointer access during disconnect from subsystem
scsi: qla2xxx: Check if FW supports MQ before enabling
scsi: qla2xxx: Fix WARN_ON in qla_nvme_register_hba
scsi: qla2xxx: Allow ql2xextended_error_logging special value 1 to be set anytime
scsi: qla2xxx: Reduce noisy debug message
scsi: qla2xxx: Fix login timeout
scsi: qla2xxx: Indicate correct supported speeds for Mezz card
scsi: qla2xxx: Flush I/O on zone disable
scsi: qla2xxx: Flush all sessions on zone disable
scsi: qla2xxx: Use MBX_TOV_SECONDS for mailbox command timeout values
scsi: scsi_debug: Fix scp is NULL errors
scsi: zfcp: Fix use-after-free in request timeout handlers
scsi: ufs: No need to send Abort Task if the task in DB was cleared
scsi: ufs: Clean up completed request without interrupt notification
scsi: ufs: Improve interrupt handling for shared interrupts
scsi: ufs: Fix interrupt error message for shared interrupts
scsi: ufs-pci: Add quirk for broken auto-hibernate for Intel EHL
scsi: ufs-mediatek: Fix incorrect time to wait link status
...
Linus Torvalds [Sat, 22 Aug 2020 16:31:11 +0000 (09:31 -0700)]
Merge tag 'devicetree-fixes-for-5.9-2' of git://git./linux/kernel/git/robh/linux
Pull devicetree fixes from Rob Herring:
"Another set of DT fixes:
- restore range parsing error check
- workaround PCI range parsing with missing 'device_type' now
required
- correct description of 'phy-connection-type'
- fix erroneous matching on 'snps,dw-pcie' by 'intel,lgm-pcie' schema
- a couple of grammar and whitespace fixes
- update Shawn Guo's email"
* tag 'devicetree-fixes-for-5.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
dt-bindings: vendor-prefixes: Remove trailing whitespace
dt-bindings: net: correct description of phy-connection-type
dt-bindings: PCI: intel,lgm-pcie: Fix matching on all snps,dw-pcie instances
of: address: Work around missing device_type property in pcie nodes
dt: writing-schema: Miscellaneous grammar fixes
dt-bindings: Use Shawn Guo's preferred e-mail for i.MX bindings
of/address: check for invalid range.cpu_addr
Geert Uytterhoeven [Wed, 19 Aug 2020 09:20:58 +0000 (11:20 +0200)]
dt-bindings: vendor-prefixes: Remove trailing whitespace
Fixes:
f516fb704d02fff2 ("dt-bindings: Whitespace clean-ups in schema files")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://lore.kernel.org/r/20200819092058.1526-1-geert+renesas@glider.be
Signed-off-by: Rob Herring <robh@kernel.org>
Will Deacon [Tue, 11 Aug 2020 10:27:25 +0000 (11:27 +0100)]
KVM: arm64: Only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is not set
When an MMU notifier call results in unmapping a range that spans multiple
PGDs, we end up calling into cond_resched_lock() when crossing a PGD boundary,
since this avoids running into RCU stalls during VM teardown. Unfortunately,
if the VM is destroyed as a result of OOM, then blocking is not permitted
and the call to the scheduler triggers the following BUG():
| BUG: sleeping function called from invalid context at arch/arm64/kvm/mmu.c:394
| in_atomic(): 1, irqs_disabled(): 0, non_block: 1, pid: 36, name: oom_reaper
| INFO: lockdep is turned off.
| CPU: 3 PID: 36 Comm: oom_reaper Not tainted 5.8.0 #1
| Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
| Call trace:
| dump_backtrace+0x0/0x284
| show_stack+0x1c/0x28
| dump_stack+0xf0/0x1a4
| ___might_sleep+0x2bc/0x2cc
| unmap_stage2_range+0x160/0x1ac
| kvm_unmap_hva_range+0x1a0/0x1c8
| kvm_mmu_notifier_invalidate_range_start+0x8c/0xf8
| __mmu_notifier_invalidate_range_start+0x218/0x31c
| mmu_notifier_invalidate_range_start_nonblock+0x78/0xb0
| __oom_reap_task_mm+0x128/0x268
| oom_reap_task+0xac/0x298
| oom_reaper+0x178/0x17c
| kthread+0x1e4/0x1fc
| ret_from_fork+0x10/0x30
Use the new 'flags' argument to kvm_unmap_hva_range() to ensure that we
only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is set in the notifier
flags.
Cc: <stable@vger.kernel.org>
Fixes:
8b3405e345b5 ("kvm: arm/arm64: Fix locking for kvm_free_stage2_pgd")
Cc: Marc Zyngier <maz@kernel.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <
20200811102725.7121-3-will@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Will Deacon [Tue, 11 Aug 2020 10:27:24 +0000 (11:27 +0100)]
KVM: Pass MMU notifier range flags to kvm_unmap_hva_range()
The 'flags' field of 'struct mmu_notifier_range' is used to indicate
whether invalidate_range_{start,end}() are permitted to block. In the
case of kvm_mmu_notifier_invalidate_range_start(), this field is not
forwarded on to the architecture-specific implementation of
kvm_unmap_hva_range() and therefore the backend cannot sensibly decide
whether or not to block.
Add an extra 'flags' parameter to kvm_unmap_hva_range() so that
architectures are aware as to whether or not they are permitted to block.
Cc: <stable@vger.kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <
20200811102725.7121-2-will@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Madalin Bucur [Thu, 20 Aug 2020 10:02:04 +0000 (13:02 +0300)]
dt-bindings: net: correct description of phy-connection-type
The phy-connection-type parameter is described in ePAPR 1.1:
Specifies interface type between the Ethernet device and a physical
layer (PHY) device. The value of this property is specific to the
implementation.
Signed-off-by: Madalin Bucur <madalin.bucur@oss.nxp.com>
Link: https://lore.kernel.org/r/1597917724-11127-1-git-send-email-madalin.bucur@oss.nxp.com
Signed-off-by: Rob Herring <robh@kernel.org>
Linus Torvalds [Fri, 21 Aug 2020 21:59:16 +0000 (14:59 -0700)]
Merge tag 'io_uring-5.9-2020-08-21' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
- Make sure the head link cancelation includes async work
- Get rid of kiocb_wait_page_queue_init(), makes no sense to have it as
a separate function since you moved it into io_uring itself
- io_import_iovec cleanups (Pavel, me)
- Use system_unbound_wq for ring exit work, to avoid spawning tons of
these if we have tons of rings exiting at the same time
- Fix req->flags overflow flag manipulation (Pavel)
* tag 'io_uring-5.9-2020-08-21' of git://git.kernel.dk/linux-block:
io_uring: kill extra iovec=NULL in import_iovec()
io_uring: comment on kfree(iovec) checks
io_uring: fix racy req->flags modification
io_uring: use system_unbound_wq for ring exit work
io_uring: cleanup io_import_iovec() of pre-mapped request
io_uring: get rid of kiocb_wait_page_queue_init()
io_uring: find and cancel head link async work on files exit
Rob Herring [Wed, 19 Aug 2020 17:58:16 +0000 (11:58 -0600)]
dt-bindings: PCI: intel,lgm-pcie: Fix matching on all snps,dw-pcie instances
The intel,lgm-pcie binding is matching on all snps,dw-pcie instances
which is wrong. Add a custom 'select' entry to fix this.
Fixes:
e54ea45a4955 ("dt-bindings: PCI: intel: Add YAML schemas for the PCIe RC controller")
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
Reviewed-by: Dilip Kota <eswara.kota@linux.intel.com>
Signed-off-by: Rob Herring <robh@kernel.org>
Linus Torvalds [Fri, 21 Aug 2020 21:44:48 +0000 (14:44 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"11 patches.
Subsystems affected by this: misc, mm/hugetlb, mm/vmalloc, mm/misc,
romfs, relay, uprobes, squashfs, mm/cma, mm/pagealloc"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm, page_alloc: fix core hung in free_pcppages_bulk()
mm: include CMA pages in lowmem_reserve at boot
squashfs: avoid bio_alloc() failure with 1Mbyte blocks
uprobes: __replace_page() avoid BUG in munlock_vma_page()
kernel/relay.c: fix memleak on destroy relay channel
romfs: fix uninitialized memory leak in romfs_dev_read()
mm/rodata_test.c: fix missing function declaration
mm/vunmap: add cond_resched() in vunmap_pmd_range
khugepaged: adjust VM_BUG_ON_MM() in __khugepaged_enter()
hugetlb_cgroup: convert comma to semicolon
mailmap: add Andi Kleen
David S. Miller [Fri, 21 Aug 2020 19:54:50 +0000 (12:54 -0700)]
Merge git://git./pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:
====================
pull-request: bpf 2020-08-21
The following pull-request contains BPF updates for your *net* tree.
We've added 11 non-merge commits during the last 5 day(s) which contain
a total of 12 files changed, 78 insertions(+), 24 deletions(-).
The main changes are:
1) three fixes in BPF task iterator logic, from Yonghong.
2) fix for compressed dwarf sections in vmlinux, from Jiri.
3) fix xdp attach regression, from Andrii.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 21 Aug 2020 19:32:42 +0000 (12:32 -0700)]
Merge tag 'riscv-for-linus-5.9-rc2' of git://git./linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- The CLINT driver has been split in two: one to handle the M-mode
CLINT (memory mapped and used on NOMMU systems) and one to handle the
S-mode CLINT (via SBI).
- The addition of SiFive's drivers to rv32_defconfig
* tag 'riscv-for-linus-5.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: Add SiFive drivers to rv32_defconfig
dt-bindings: timer: Add CLINT bindings
RISC-V: Remove CLINT related code from timer and arch
clocksource/drivers: Add CLINT timer driver
RISC-V: Add mechanism to provide custom IPI operations
Linus Torvalds [Fri, 21 Aug 2020 19:28:33 +0000 (12:28 -0700)]
Merge tag 'for-linus-5.9-rc2-tag' of git://git./linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"One build fix and a minor fix for suppressing a useless warning when
booting a Xen dom0 via UEFI"
* tag 'for-linus-5.9-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
Fix build error when CONFIG_ACPI is not set/enabled:
efi: avoid error message when booting under Xen
Linus Torvalds [Fri, 21 Aug 2020 19:26:58 +0000 (12:26 -0700)]
Merge tag 'pm-5.9-rc2' of git://git./linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix a few issues in the operating performance points (OPP)
framework.
Specifics:
- Fix re-enabling of resources in dev_pm_opp_set_rate() (Rajendra
Nayak)
- Fix OPP table reference counting in error paths (Stephen Boyd)"
* tag 'pm-5.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
opp: Enable resources again if they were disabled earlier
opp: Put opp table in dev_pm_opp_set_rate() if _set_opp_bw() fails
opp: Put opp table in dev_pm_opp_set_rate() for empty tables
Tobias Klauser [Fri, 21 Aug 2020 13:36:42 +0000 (15:36 +0200)]
bpf: Fix two typos in uapi/linux/bpf.h
Also remove trailing whitespaces in bpf_skb_get_tunnel_key example code.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200821133642.18870-1-tklauser@distanz.ch
Tom Rix [Fri, 21 Aug 2020 13:56:00 +0000 (06:56 -0700)]
net: dsa: b53: check for timeout
clang static analysis reports this problem
b53_common.c:1583:13: warning: The left expression of the compound
assignment is an uninitialized value. The computed value will
also be garbage
ent.port &= ~BIT(port);
~~~~~~~~ ^
ent is set by a successful call to b53_arl_read(). Unsuccessful
calls are caught by an switch statement handling specific returns.
b32_arl_read() calls b53_arl_op_wait() which fails with the
unhandled -ETIMEDOUT.
So add -ETIMEDOUT to the switch statement. Because
b53_arl_op_wait() already prints out a message, do not add another
one.
Fixes:
1da6df85c6fb ("net: dsa: b53: Implement ARL add/del/dump operations")
Signed-off-by: Tom Rix <trix@redhat.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stephen Boyd [Tue, 18 Aug 2020 01:49:50 +0000 (18:49 -0700)]
ARM64: vdso32: Install vdso32 from vdso_install
Add the 32-bit vdso Makefile to the vdso_install rule so that 'make
vdso_install' installs the 32-bit compat vdso when it is compiled.
Fixes:
a7f71a2c8903 ("arm64: compat: Add vDSO")
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Link: https://lore.kernel.org/r/20200818014950.42492-1-swboyd@chromium.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Linus Torvalds [Fri, 21 Aug 2020 18:03:38 +0000 (11:03 -0700)]
Merge tag 'ext4_for_linus' of git://git./linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Improvements to ext4's block allocator performance for very large file
systems, especially when the file system or files which are highly
fragmented. There is a new mount option, prefetch_block_bitmaps which
will pull in the block bitmaps and set up the in-memory buddy bitmaps
when the file system is initially mounted.
Beyond that, a lot of bug fixes and cleanups. In particular, a number
of changes to make ext4 more robust in the face of write errors or
file system corruptions"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (46 commits)
ext4: limit the length of per-inode prealloc list
ext4: reorganize if statement of ext4_mb_release_context()
ext4: add mb_debug logging when there are lost chunks
ext4: Fix comment typo "the the".
jbd2: clean up checksum verification in do_one_pass()
ext4: change to use fallthrough macro
ext4: remove unused parameter of ext4_generic_delete_entry function
mballoc: replace seq_printf with seq_puts
ext4: optimize the implementation of ext4_mb_good_group()
ext4: delete invalid comments near ext4_mb_check_limits()
ext4: fix typos in ext4_mb_regular_allocator() comment
ext4: fix checking of directory entry validity for inline directories
fs: prevent BUG_ON in submit_bh_wbc()
ext4: correctly restore system zone info when remount fails
ext4: handle add_system_zone() failure in ext4_setup_system_zone()
ext4: fold ext4_data_block_valid_rcu() into the caller
ext4: check journal inode extents more carefully
ext4: don't allow overlapping system zones
ext4: handle error of ext4_setup_system_zone() on remount
ext4: delete the invalid BUGON in ext4_mb_load_buddy_gfp()
...
David Howells [Fri, 21 Aug 2020 09:15:12 +0000 (10:15 +0100)]
afs: Fix NULL deref in afs_dynroot_depopulate()
If an error occurs during the construction of an afs superblock, it's
possible that an error occurs after a superblock is created, but before
we've created the root dentry. If the superblock has a dynamic root
(ie. what's normally mounted on /afs), the afs_kill_super() will call
afs_dynroot_depopulate() to unpin any created dentries - but this will
oops if the root hasn't been created yet.
Fix this by skipping that bit of code if there is no root dentry.
This leads to an oops looking like:
general protection fault, ...
KASAN: null-ptr-deref in range [0x0000000000000068-0x000000000000006f]
...
RIP: 0010:afs_dynroot_depopulate+0x25f/0x529 fs/afs/dynroot.c:385
...
Call Trace:
afs_kill_super+0x13b/0x180 fs/afs/super.c:535
deactivate_locked_super+0x94/0x160 fs/super.c:335
afs_get_tree+0x1124/0x1460 fs/afs/super.c:598
vfs_get_tree+0x89/0x2f0 fs/super.c:1547
do_new_mount fs/namespace.c:2875 [inline]
path_mount+0x1387/0x2070 fs/namespace.c:3192
do_mount fs/namespace.c:3205 [inline]
__do_sys_mount fs/namespace.c:3413 [inline]
__se_sys_mount fs/namespace.c:3390 [inline]
__x64_sys_mount+0x27f/0x300 fs/namespace.c:3390
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
which is oopsing on this line:
inode_lock(root->d_inode);
presumably because sb->s_root was NULL.
Fixes:
0da0b7fd73e4 ("afs: Display manually added cells in dynamic root mount")
Reported-by: syzbot+c1eff8205244ae7e11a6@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 21 Aug 2020 17:14:16 +0000 (10:14 -0700)]
Merge tag 'for-linus' of git://git./linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"One regression from 5.8 and a few bugs from earlier kernels:
- Various spelling corrections in kernel prints
- Bug fixes in hfi1 and bntx_re
- Revert a 5.8 patch in hns
- Batch update for Mellanox and Cumulus maintainers emails"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
MAINTAINERS: Update Mellanox and Cumulus Network addresses to new domain
Revert "RDMA/hns: Reserve one sge in order to avoid local length error"
RDMA/hfi1: Correct an interlock issue for TID RDMA WRITE request
RDMA/bnxt_re: Do not add user qps to flushlist
RDMA/core: Fix spelling mistake "Could't" -> "Couldn't"
RDMA/usnic: Fix spelling mistake "transistion" -> "transition"
RDMA/hns: Fix spelling mistake "epmty" -> "empty"
Linus Torvalds [Fri, 21 Aug 2020 17:07:54 +0000 (10:07 -0700)]
Merge tag 'sound-5.9-rc2' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A collection of small fixes over several drivers, but all are driver-
specific and nothing looks scary.
Slightly large changes are seen in ASoC qcom driver for the bugs that
were revealed by the recent ASoC core change to report the invalid
register access errors. Also ASoC fsl got a slight intensive change
for the distortion fix.
Others are only trivial fixes or device-specific quirks"
* tag 'sound-5.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (25 commits)
ALSA: hda: avoid reset of sdo_limit
ALSA: hda/realtek: Add quirk for Samsung Galaxy Book Ion
ALSA: usb-audio: ignore broken processing/extension unit
ASoC: intel: Fix memleak in sst_media_open
ASoC: wm8994: Avoid attempts to read unreadable registers
ASoC: msm8916-wcd-analog: fix register Interrupt offset
ASoC: wm8994: Prevent access to invalid VU register bits on WM1811
ALSA: hda/realtek: Add model alc298-samsung-headphone
ALSA: usb-audio: Update documentation comment for MS2109 quirk
ALSA: isa: fix spelling mistakes in the comments
ALSA: usb-audio: Add capture support for Saffire 6 (USB 1.1)
ALSA: hda/realtek: Add quirk for Samsung Galaxy Flex Book
ASoC: q6routing: add dummy register read/write function
ASoC: q6afe-dai: mark all widgets registers as SND_SOC_NOPM
ASoC: Make soc_component_read() returning an error code again
ASoC: amd: Replacing component->name with codec_dai->name.
ASoC: fsl: Fix unused variable warning
ASoC: tegra: tegra210_i2s: Fix compile warning with CONFIG_PM=n
ASoC: tegra: tegra210_dmic: Fix compile warning with CONFIG_PM=n
ASoC: tegra: tegra210_ahub: Fix compile warning with CONFIG_PM=n
...
Linus Torvalds [Fri, 21 Aug 2020 17:02:44 +0000 (10:02 -0700)]
Merge tag 'drm-fixes-2020-08-21' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"Regular fixes pull for rc2. Usual rc2 doesn't seem too busy, mainly
i915 and amdgpu. I'd expect the usual uptick for rc3.
amdgpu:
- Fix allocation size
- SR-IOV fixes
- Vega20 SMU feature state caching fix
- Fix custom pptable handling
- Arcturus golden settings update
- Several display fixes
- Fixes for Navy Flounder
- Misc display fixes
- RAS fix
amdkfd:
- SDMA fix for renoir
i915:
- Fix device parameter usage for selftest mock i915 device
- Fix LPSP capability debugfs NULL dereference
- Fix buddy register pagemask table
- Fix intel_atomic_check() non-negative return value
- Fix selftests passing a random 0 into ilog2()
- Fix TGL power well enable/disable ordering
- Switch to PMU module refcounting
- GVT fixes
virtio:
- Add missing dma_fence_put() in virtio_gpu_execbuffer_ioctl()
- Fix memory leak in virtio_gpu_cleanup_object()"
* tag 'drm-fixes-2020-08-21' of git://anongit.freedesktop.org/drm/drm: (34 commits)
Revert "drm/amdgpu: disable gfxoff for navy_flounder"
drm/i915/tgl: Make sure TC-cold is blocked before enabling TC AUX power wells
drm/i915/selftests: Avoid passing a random 0 into ilog2
drm/i915: Fix wrong return value in intel_atomic_check()
drm/i915: Update bw_buddy pagemask table
drm/i915/display: Check for an LPSP encoder before dereferencing
drm/i915: Copy default modparams to mock i915_device
drm/i915: Provide the perf pmu.module
drm/amd/display: fix pow() crashing when given base 0
drm/amd/display: Reset scrambling on Test Pattern
drm/amd/display: fix dcn3 wide timing dsc validation
drm/amd/display: Fix DFPstate hang due to view port changed
drm/amd/display: Assign correct left shift
drm/amd/display: Call DMUB for eDP power control
drm/amdkfd: fix the wrong sdma instance query for renoir
drm/amdgpu: parse ta firmware for navy_flounder
drm/amdgpu: fix NULL pointer access issue when unloading driver
drm/amdgpu: fix uninit-value in arcturus_log_thermal_throttling_event()
drm/amdgpu: disable gfxoff for navy_flounder
drm/amdgpu/display: use GFP_ATOMIC in dcn20_validate_bandwidth_internal
...
Charan Teja Reddy [Fri, 21 Aug 2020 00:42:27 +0000 (17:42 -0700)]
mm, page_alloc: fix core hung in free_pcppages_bulk()
The following race is observed with the repeated online, offline and a
delay between two successive online of memory blocks of movable zone.
P1 P2
Online the first memory block in
the movable zone. The pcp struct
values are initialized to default
values,i.e., pcp->high = 0 &
pcp->batch = 1.
Allocate the pages from the
movable zone.
Try to Online the second memory
block in the movable zone thus it
entered the online_pages() but yet
to call zone_pcp_update().
This process is entered into
the exit path thus it tries
to release the order-0 pages
to pcp lists through
free_unref_page_commit().
As pcp->high = 0, pcp->count = 1
proceed to call the function
free_pcppages_bulk().
Update the pcp values thus the
new pcp values are like, say,
pcp->high = 378, pcp->batch = 63.
Read the pcp's batch value using
READ_ONCE() and pass the same to
free_pcppages_bulk(), pcp values
passed here are, batch = 63,
count = 1.
Since num of pages in the pcp
lists are less than ->batch,
then it will stuck in
while(list_empty(list)) loop
with interrupts disabled thus
a core hung.
Avoid this by ensuring free_pcppages_bulk() is called with proper count of
pcp list pages.
The mentioned race is some what easily reproducible without [1] because
pcp's are not updated for the first memory block online and thus there is
a enough race window for P2 between alloc+free and pcp struct values
update through onlining of second memory block.
With [1], the race still exists but it is very narrow as we update the pcp
struct values for the first memory block online itself.
This is not limited to the movable zone, it could also happen in cases
with the normal zone (e.g., hotplug to a node that only has DMA memory, or
no other memory yet).
[1]: https://patchwork.kernel.org/patch/
11696389/
Fixes:
5f8dcc21211a ("page-allocator: split per-cpu list into one-list-per-migrate-type")
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: <stable@vger.kernel.org> [2.6+]
Link: http://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeaurora.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Doug Berger [Fri, 21 Aug 2020 00:42:24 +0000 (17:42 -0700)]
mm: include CMA pages in lowmem_reserve at boot
The lowmem_reserve arrays provide a means of applying pressure against
allocations from lower zones that were targeted at higher zones. Its
values are a function of the number of pages managed by higher zones and
are assigned by a call to the setup_per_zone_lowmem_reserve() function.
The function is initially called at boot time by the function
init_per_zone_wmark_min() and may be called later by accesses of the
/proc/sys/vm/lowmem_reserve_ratio sysctl file.
The function init_per_zone_wmark_min() was moved up from a module_init to
a core_initcall to resolve a sequencing issue with khugepaged.
Unfortunately this created a sequencing issue with CMA page accounting.
The CMA pages are added to the managed page count of a zone when
cma_init_reserved_areas() is called at boot also as a core_initcall. This
makes it uncertain whether the CMA pages will be added to the managed page
counts of their zones before or after the call to
init_per_zone_wmark_min() as it becomes dependent on link order. With the
current link order the pages are added to the managed count after the
lowmem_reserve arrays are initialized at boot.
This means the lowmem_reserve values at boot may be lower than the values
used later if /proc/sys/vm/lowmem_reserve_ratio is accessed even if the
ratio values are unchanged.
In many cases the difference is not significant, but for example
an ARM platform with 1GB of memory and the following memory layout
cma: Reserved 256 MiB at 0x0000000030000000
Zone ranges:
DMA [mem 0x0000000000000000-0x000000002fffffff]
Normal empty
HighMem [mem 0x0000000030000000-0x000000003fffffff]
would result in 0 lowmem_reserve for the DMA zone. This would allow
userspace to deplete the DMA zone easily.
Funnily enough
$ cat /proc/sys/vm/lowmem_reserve_ratio
would fix up the situation because as a side effect it forces
setup_per_zone_lowmem_reserve.
This commit breaks the link order dependency by invoking
init_per_zone_wmark_min() as a postcore_initcall so that the CMA pages
have the chance to be properly accounted in their zone(s) and allowing
the lowmem_reserve arrays to receive consistent values.
Fixes:
bc22af74f271 ("mm: update min_free_kbytes from khugepaged after core initialization")
Signed-off-by: Doug Berger <opendmb@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/1597423766-27849-1-git-send-email-opendmb@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Phillip Lougher [Fri, 21 Aug 2020 00:42:21 +0000 (17:42 -0700)]
squashfs: avoid bio_alloc() failure with 1Mbyte blocks
This is a regression introduced by the patch "migrate from ll_rw_block
usage to BIO".
Bio_alloc() is limited to 256 pages (1 Mbyte). This can cause a failure
when reading 1 Mbyte block filesystems. The problem is a datablock can be
fully (or almost uncompressed), requiring 256 pages, but, because blocks
are not aligned to page boundaries, it may require 257 pages to read.
Bio_kmalloc() can handle 1024 pages, and so use this for the edge
condition.
Fixes:
93e72b3c612a ("squashfs: migrate from ll_rw_block usage to BIO")
Reported-by: Nicolas Prochazka <nicolas.prochazka@gmail.com>
Reported-by: Tomoatsu Shimada <shimada@walbrix.com>
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Cc: Philippe Liard <pliard@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Adrien Schildknecht <adrien+dev@schischi.me>
Cc: Daniel Rosenberg <drosen@google.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200815035637.15319-1-phillip@squashfs.org.uk
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Fri, 21 Aug 2020 00:42:17 +0000 (17:42 -0700)]
uprobes: __replace_page() avoid BUG in munlock_vma_page()
syzbot crashed on the VM_BUG_ON_PAGE(PageTail) in munlock_vma_page(), when
called from uprobes __replace_page(). Which of many ways to fix it?
Settled on not calling when PageCompound (since Head and Tail are equals
in this context, PageCompound the usual check in uprobes.c, and the prior
use of FOLL_SPLIT_PMD will have cleared PageMlocked already).
Fixes:
5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org> [5.4+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008161338360.20413@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wei Yongjun [Fri, 21 Aug 2020 00:42:14 +0000 (17:42 -0700)]
kernel/relay.c: fix memleak on destroy relay channel
kmemleak report memory leak as follows:
unreferenced object 0x607ee4e5f948 (size 8):
comm "syz-executor.1", pid 2098, jiffies
4295031601 (age 288.468s)
hex dump (first 8 bytes):
00 00 00 00 00 00 00 00 ........
backtrace:
relay_open kernel/relay.c:583 [inline]
relay_open+0xb6/0x970 kernel/relay.c:563
do_blk_trace_setup+0x4a8/0xb20 kernel/trace/blktrace.c:557
__blk_trace_setup+0xb6/0x150 kernel/trace/blktrace.c:597
blk_trace_ioctl+0x146/0x280 kernel/trace/blktrace.c:738
blkdev_ioctl+0xb2/0x6a0 block/ioctl.c:613
block_ioctl+0xe5/0x120 fs/block_dev.c:1871
vfs_ioctl fs/ioctl.c:48 [inline]
__do_sys_ioctl fs/ioctl.c:753 [inline]
__se_sys_ioctl fs/ioctl.c:739 [inline]
__x64_sys_ioctl+0x170/0x1ce fs/ioctl.c:739
do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
'chan->buf' is malloced in relay_open() by alloc_percpu() but not free
while destroy the relay channel. Fix it by adding free_percpu() before
return from relay_destroy_channel().
Fixes:
017c59c042d0 ("relay: Use per CPU constructs for the relay channel buffer pointers")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: David Rientjes <rientjes@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Akash Goel <akash.goel@intel.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200817122826.48518-1-weiyongjun1@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jann Horn [Fri, 21 Aug 2020 00:42:11 +0000 (17:42 -0700)]
romfs: fix uninitialized memory leak in romfs_dev_read()
romfs has a superblock field that limits the size of the filesystem; data
beyond that limit is never accessed.
romfs_dev_read() fetches a caller-supplied number of bytes from the
backing device. It returns 0 on success or an error code on failure;
therefore, its API can't represent short reads, it's all-or-nothing.
However, when romfs_dev_read() detects that the requested operation would
cross the filesystem size limit, it currently silently truncates the
requested number of bytes. This e.g. means that when the content of a
file with size 0x1000 starts one byte before the filesystem size limit,
->readpage() will only fill a single byte of the supplied page while
leaving the rest uninitialized, leaking that uninitialized memory to
userspace.
Fix it by returning an error code instead of truncating the read when the
requested read operation would go beyond the end of the filesystem.
Fixes:
da4458bda237 ("NOMMU: Make it possible for RomFS to use MTD devices directly")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: David Howells <dhowells@redhat.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200818013202.2246365-1-jannh@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Leon Romanovsky [Fri, 21 Aug 2020 00:42:08 +0000 (17:42 -0700)]
mm/rodata_test.c: fix missing function declaration
The compilation with CONFIG_DEBUG_RODATA_TEST set produces the following
warning due to the missing include.
mm/rodata_test.c:15:6: warning: no previous prototype for 'rodata_test' [-Wmissing-prototypes]
15 | void rodata_test(void)
| ^~~~~~~~~~~
Fixes:
2959a5f726f6 ("mm: add arch-independent testcases for RODATA")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lkml.kernel.org/r/20200819080026.918134-1-leon@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Aneesh Kumar K.V [Fri, 21 Aug 2020 00:42:05 +0000 (17:42 -0700)]
mm/vunmap: add cond_resched() in vunmap_pmd_range
Like zap_pte_range add cond_resched so that we can avoid softlockups as
reported below. On non-preemptible kernel with large I/O map region (like
the one we get when using persistent memory with sector mode), an unmap of
the namespace can report below softlockups.
22724.027334] watchdog: BUG: soft lockup - CPU#49 stuck for 23s! [ndctl:50777]
NIP [
c0000000000dc224] plpar_hcall+0x38/0x58
LR [
c0000000000d8898] pSeries_lpar_hpte_invalidate+0x68/0xb0
Call Trace:
flush_hash_page+0x114/0x200
hpte_need_flush+0x2dc/0x540
vunmap_page_range+0x538/0x6f0
free_unmap_vmap_area+0x30/0x70
remove_vm_area+0xfc/0x140
__vunmap+0x68/0x270
__iounmap.part.0+0x34/0x60
memunmap+0x54/0x70
release_nodes+0x28c/0x300
device_release_driver_internal+0x16c/0x280
unbind_store+0x124/0x170
drv_attr_store+0x44/0x60
sysfs_kf_write+0x64/0x90
kernfs_fop_write+0x1b0/0x290
__vfs_write+0x3c/0x70
vfs_write+0xd8/0x260
ksys_write+0xdc/0x130
system_call+0x5c/0x70
Reported-by: Harish Sriram <harish@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200807075933.310240-1-aneesh.kumar@linux.ibm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Fri, 21 Aug 2020 00:42:02 +0000 (17:42 -0700)]
khugepaged: adjust VM_BUG_ON_MM() in __khugepaged_enter()
syzbot crashes on the VM_BUG_ON_MM(khugepaged_test_exit(mm), mm) in
__khugepaged_enter(): yes, when one thread is about to dump core, has set
core_state, and is waiting for others, another might do something calling
__khugepaged_enter(), which now crashes because I lumped the core_state
test (known as "mmget_still_valid") into khugepaged_test_exit(). I still
think it's best to lump them together, so just in this exceptional case,
check mm->mm_users directly instead of khugepaged_test_exit().
Fixes:
bbe98f9cadff ("khugepaged: khugepaged_test_exit() check mmget_still_valid()")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Yang Shi <shy828301@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: <stable@vger.kernel.org> [4.8+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008141503370.18085@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xu Wang [Fri, 21 Aug 2020 00:41:59 +0000 (17:41 -0700)]
hugetlb_cgroup: convert comma to semicolon
Replace a comma between expression statements by a semicolon.
Fixes:
faced7e0806cf4 ("mm: hugetlb controller for cgroups v2")
Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Giuseppe Scrivano <gscrivan@redhat.com>
Link: http://lkml.kernel.org/r/20200818064333.21759-1-vulab@iscas.ac.cn
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nick Desaulniers [Fri, 21 Aug 2020 00:41:56 +0000 (17:41 -0700)]
mailmap: add Andi Kleen
I keep getting bounce back from the suse.de address.
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Quentin Perret <qperret@qperret.net>
Link: http://lkml.kernel.org/r/20200818203214.659955-1-ndesaulniers@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thomas Gleixner [Wed, 19 Aug 2020 19:44:39 +0000 (21:44 +0200)]
core/entry: Respect syscall number rewrites
The transcript of the x86 entry code to the generic version failed to
reload the syscall number from ptregs after ptrace and seccomp have run,
which both can modify the syscall number in ptregs. It returns the original
syscall number instead which is obviously not the right thing to do.
Reload the syscall number to fix that.
Fixes:
142781e108b1 ("entry: Provide generic syscall entry functionality")
Reported-by: Kyle Huey <me@kylehuey.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kyle Huey <me@kylehuey.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/87blj6ifo8.fsf@nanos.tec.linutronix.de
Sean Christopherson [Fri, 21 Aug 2020 10:52:29 +0000 (06:52 -0400)]
x86/entry/64: Do not use RDPID in paranoid entry to accomodate KVM
KVM has an optmization to avoid expensive MRS read/writes on
VMENTER/EXIT. It caches the MSR values and restores them either when
leaving the run loop, on preemption or when going out to user space.
The affected MSRs are not required for kernel context operations. This
changed with the recently introduced mechanism to handle FSGSBASE in the
paranoid entry code which has to retrieve the kernel GSBASE value by
accessing per CPU memory. The mechanism needs to retrieve the CPU number
and uses either LSL or RDPID if the processor supports it.
Unfortunately RDPID uses MSR_TSC_AUX which is in the list of cached and
lazily restored MSRs, which means between the point where the guest value
is written and the point of restore, MSR_TSC_AUX contains a random number.
If an NMI or any other exception which uses the paranoid entry path happens
in such a context, then RDPID returns the random guest MSR_TSC_AUX value.
As a consequence this reads from the wrong memory location to retrieve the
kernel GSBASE value. Kernel GS is used to for all regular this_cpu_*()
operations. If the GSBASE in the exception handler points to the per CPU
memory of a different CPU then this has the obvious consequences of data
corruption and crashes.
As the paranoid entry path is the only place which accesses MSR_TSX_AUX
(via RDPID) and the fallback via LSL is not significantly slower, remove
the RDPID alternative from the entry path and always use LSL.
The alternative would be to write MSR_TSC_AUX on every VMENTER and VMEXIT
which would be inflicting massive overhead on that code path.
[ tglx: Rewrote changelog ]
Fixes:
eaad981291ee3 ("x86/entry/64: Introduce the FIND_PERCPU_BASE macro")
Reported-by: Tom Lendacky <thomas.lendacky@amd.com>
Debugged-by: Tom Lendacky <thomas.lendacky@amd.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200821105229.18938-1-pbonzini@redhat.com
Leon Romanovsky [Mon, 10 Aug 2020 09:10:59 +0000 (12:10 +0300)]
MAINTAINERS: Update Mellanox and Cumulus Network addresses to new domain
Mellanox and Cumulus Network were acquired by Nvidia, so change the
maintainers emails to new domain name.
Link: https://lore.kernel.org/r/20200810091100.243932-1-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Kajol Jain [Fri, 21 Aug 2020 08:06:10 +0000 (13:36 +0530)]
powerpc/perf/hv-24x7: Move cpumask file to top folder of hv-24x7 driver
Commit
792f73f747b8 ("powerpc/hv-24x7: Add sysfs files inside hv-24x7
device to show cpumask") added cpumask file as part of hv-24x7 driver
inside the interface folder. The cpumask file is supposed to be in the
top folder of the PMU driver in order to make hotplug work.
This patch fixes that issue and creates new group 'cpumask_attr_group'
to add cpumask file and make sure it added in top folder.
command:# cat /sys/devices/hv_24x7/cpumask
0
Fixes:
792f73f747b8 ("powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show cpumask")
Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200821080610.123997-1-kjain@linux.ibm.com
Christophe Leroy [Fri, 21 Aug 2020 07:15:25 +0000 (07:15 +0000)]
powerpc/32s: Fix module loading failure when VMALLOC_END is over 0xf0000000
In is_module_segment(), when VMALLOC_END is over 0xf0000000,
ALIGN(VMALLOC_END, SZ_256M) has value 0.
In that case, addr >= ALIGN(VMALLOC_END, SZ_256M) is always
true then is_module_segment() always returns false.
Use (ALIGN(VMALLOC_END, SZ_256M) - 1) which will have
value 0xffffffff and will be suitable for the comparison.
Fixes:
c49643319715 ("powerpc/32s: Only leave NX unset on segments used for modules")
Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Tested-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/09fc73fe9c7423c6b4cf93f93df9bb0ed8eefab5.1597994047.git.christophe.leroy@csgroup.eu
Rob Herring [Mon, 3 Aug 2020 19:31:25 +0000 (13:31 -0600)]
KVM: arm64: Print warning when cpu erratum can cause guests to deadlock
If guests don't have certain CPU erratum workarounds implemented, then
there is a possibility a guest can deadlock the system. IOW, only trusted
guests should be used on systems with the erratum.
This is the case for Cortex-A57 erratum 832075.
Signed-off-by: Rob Herring <robh@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Julien Thierry <julien.thierry.kdev@gmail.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: kvmarm@lists.cs.columbia.edu
Link: https://lore.kernel.org/r/20200803193127.3012242-2-robh@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Marc Zyngier [Fri, 31 Jul 2020 17:38:24 +0000 (18:38 +0100)]
arm64: Allow booting of late CPUs affected by erratum 1418040
As we can now switch from a system that isn't affected by 1418040
to a system that globally is affected, let's allow affected CPUs
to come in at a later time.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20200731173824.107480-3-maz@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Marc Zyngier [Fri, 31 Jul 2020 17:38:23 +0000 (18:38 +0100)]
arm64: Move handling of erratum 1418040 into C code
Instead of dealing with erratum 1418040 on each entry and exit,
let's move the handling to __switch_to() instead, which has
several advantages:
- It can be applied when it matters (switching between 32 and 64
bit tasks).
- It is written in C (yay!)
- It can rely on static keys rather than alternatives
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20200731173824.107480-2-maz@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>