platform/kernel/linux-rpi.git
3 years agomm/mremap: don't enable optimized PUD move if page table levels is 2
Aneesh Kumar K.V [Thu, 8 Jul 2021 01:10:09 +0000 (18:10 -0700)]
mm/mremap: don't enable optimized PUD move if page table levels is 2

With two level page table don't enable move_normal_pud.

Link: https://lkml.kernel.org/r/20210616045239.370802-5-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/mremap: convert huge PUD move to separate helper
Aneesh Kumar K.V [Thu, 8 Jul 2021 01:10:06 +0000 (18:10 -0700)]
mm/mremap: convert huge PUD move to separate helper

With TRANSPARENT_HUGEPAGE_PUD enabled the kernel can find huge PUD
entries.  Add a helper to move huge PUD entries on mremap().

This will be used by a later patch to optimize mremap of PUD_SIZE aligned
level 4 PTE mapped address

This also make sure we support mremap on huge PUD entries even with
CONFIG_HAVE_MOVE_PUD disabled.

[aneesh.kumar@linux.ibm.com: fix build failure with clang-10]
Link: https://lore.kernel.org/lkml/YMuOSnJsL9qkxweY@archlinux-ax161
Link: https://lkml.kernel.org/r/20210619134310.89098-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20210616045239.370802-4-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoselftest/mremap_test: avoid crash with static build
Aneesh Kumar K.V [Thu, 8 Jul 2021 01:10:03 +0000 (18:10 -0700)]
selftest/mremap_test: avoid crash with static build

With a large mmap map size, we can overlap with the text area and using
MAP_FIXED results in unmapping that area.  Switch to MAP_FIXED_NOREPLACE
and handle the EEXIST error.

Link: https://lkml.kernel.org/r/20210616045239.370802-3-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Kalesh Singh <kaleshsingh@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoselftest/mremap_test: update the test to handle pagesize other than 4K
Aneesh Kumar K.V [Thu, 8 Jul 2021 01:09:59 +0000 (18:09 -0700)]
selftest/mremap_test: update the test to handle pagesize other than 4K

Patch series "mrermap fixes", v2.

This patch (of 6):

Instead of hardcoding 4K page size fetch it using sysconf().  For the
performance measurements test still assume 2M and 1G are hugepage sizes.

Link: https://lkml.kernel.org/r/20210616045239.370802-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20210616045239.370802-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Kalesh Singh <kaleshsingh@google.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm: rename p4d_page_vaddr to p4d_pgtable and make it return pud_t *
Aneesh Kumar K.V [Thu, 8 Jul 2021 01:09:56 +0000 (18:09 -0700)]
mm: rename p4d_page_vaddr to p4d_pgtable and make it return pud_t *

No functional change in this patch.

[aneesh.kumar@linux.ibm.com: m68k build error reported by kernel robot]
Link: https://lkml.kernel.org/r/87tulxnb2v.fsf@linux.ibm.com
Link: https://lkml.kernel.org/r/20210615110859.320299-2-aneesh.kumar@linux.ibm.com
Link: https://lore.kernel.org/linuxppc-dev/CAHk-=wi+J+iodze9FtjM3Zi4j4OeS+qqbKxME9QN4roxPEXH9Q@mail.gmail.com/
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm: rename pud_page_vaddr to pud_pgtable and make it return pmd_t *
Aneesh Kumar K.V [Thu, 8 Jul 2021 01:09:53 +0000 (18:09 -0700)]
mm: rename pud_page_vaddr to pud_pgtable and make it return pmd_t *

No functional change in this patch.

[aneesh.kumar@linux.ibm.com: fix]
Link: https://lkml.kernel.org/r/87wnqtnb60.fsf@linux.ibm.com
[sfr@canb.auug.org.au: another fix]
Link: https://lkml.kernel.org/r/20210619134410.89559-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20210615110859.320299-1-aneesh.kumar@linux.ibm.com
Link: https://lore.kernel.org/linuxppc-dev/CAHk-=wi+J+iodze9FtjM3Zi4j4OeS+qqbKxME9QN4roxPEXH9Q@mail.gmail.com/
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agokdump: use vmlinux_build_id to simplify
Stephen Boyd [Thu, 8 Jul 2021 01:09:49 +0000 (18:09 -0700)]
kdump: use vmlinux_build_id to simplify

We can use the vmlinux_build_id array here now instead of open coding it.
This mostly consolidates code.

Link: https://lkml.kernel.org/r/20210511003845.2429846-14-swboyd@chromium.org
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agobuildid: fix kernel-doc notation
Stephen Boyd [Thu, 8 Jul 2021 01:09:46 +0000 (18:09 -0700)]
buildid: fix kernel-doc notation

Kernel doc should use "Return:" instead of "Returns" to properly reflect
the return values.

Link: https://lkml.kernel.org/r/20210511003845.2429846-13-swboyd@chromium.org
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agobuildid: mark some arguments const
Stephen Boyd [Thu, 8 Jul 2021 01:09:42 +0000 (18:09 -0700)]
buildid: mark some arguments const

These arguments are never modified so they can be marked const to indicate
as such.

Link: https://lkml.kernel.org/r/20210511003845.2429846-12-swboyd@chromium.org
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoscripts/decode_stacktrace.sh: indicate 'auto' can be used for base path
Stephen Boyd [Thu, 8 Jul 2021 01:09:38 +0000 (18:09 -0700)]
scripts/decode_stacktrace.sh: indicate 'auto' can be used for base path

Add "auto" to the usage message so that it's a little clearer that you can
pass "auto" as the second argument.  When passing "auto" the script tries
to find the base path automatically instead of requiring it be passed on
the commandline.  Also use [<variable>] to indicate the variable argument
and that it is optional so that we can differentiate from the literal
"auto" that should be passed.

Link: https://lkml.kernel.org/r/20210511003845.2429846-11-swboyd@chromium.org
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoscripts/decode_stacktrace.sh: silence stderr messages from addr2line/nm
Stephen Boyd [Thu, 8 Jul 2021 01:09:35 +0000 (18:09 -0700)]
scripts/decode_stacktrace.sh: silence stderr messages from addr2line/nm

Sometimes if you're using tools that have linked things improperly or have
new features/sections that older tools don't expect you'll see warnings
printed to stderr.  We don't really care about these warnings, so let's
just silence these messages to cleanup output of this script.

Link: https://lkml.kernel.org/r/20210511003845.2429846-10-swboyd@chromium.org
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoscripts/decode_stacktrace.sh: support debuginfod
Stephen Boyd [Thu, 8 Jul 2021 01:09:31 +0000 (18:09 -0700)]
scripts/decode_stacktrace.sh: support debuginfod

Now that stacktraces contain the build ID information we can update this
script to use debuginfod-find to locate the debuginfo for the vmlinux and
modules automatically.  This can replace the existing code that requires
specifying a path to vmlinux or tries to find the vmlinux and modules
automatically by using the release number.  Work it into the script as a
fallback option if the vmlinux isn't specified on the commandline.

Link: https://lkml.kernel.org/r/20210511003845.2429846-9-swboyd@chromium.org
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agox86/dumpstack: use %pSb/%pBb for backtrace printing
Stephen Boyd [Thu, 8 Jul 2021 01:09:27 +0000 (18:09 -0700)]
x86/dumpstack: use %pSb/%pBb for backtrace printing

Let's use the new printk formats to print the stacktrace entries when
printing a backtrace to the kernel logs.  This will include any module's
build ID[1] in it so that offline/crash debugging can easily locate the
debuginfo for a module via something like debuginfod[2].

Link: https://lkml.kernel.org/r/20210511003845.2429846-8-swboyd@chromium.org
Link: https://fedoraproject.org/wiki/Releases/FeatureBuildId
Link: https://sourceware.org/elfutils/Debuginfod.html
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoarm64: stacktrace: use %pSb for backtrace printing
Stephen Boyd [Thu, 8 Jul 2021 01:09:24 +0000 (18:09 -0700)]
arm64: stacktrace: use %pSb for backtrace printing

Let's use the new printk format to print the stacktrace entry when
printing a backtrace to the kernel logs. This will include any module's
build ID[1] in it so that offline/crash debugging can easily locate the
debuginfo for a module via something like debuginfod[2].

Link: https://lkml.kernel.org/r/20210511003845.2429846-7-swboyd@chromium.org
Link: https://fedoraproject.org/wiki/Releases/FeatureBuildId
Link: https://sourceware.org/elfutils/Debuginfod.html
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Young <dyoung@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomodule: add printk formats to add module build ID to stacktraces
Stephen Boyd [Thu, 8 Jul 2021 01:09:20 +0000 (18:09 -0700)]
module: add printk formats to add module build ID to stacktraces

Let's make kernel stacktraces easier to identify by including the build
ID[1] of a module if the stacktrace is printing a symbol from a module.
This makes it simpler for developers to locate a kernel module's full
debuginfo for a particular stacktrace.  Combined with
scripts/decode_stracktrace.sh, a developer can download the matching
debuginfo from a debuginfod[2] server and find the exact file and line
number for the functions plus offsets in a stacktrace that match the
module.  This is especially useful for pstore crash debugging where the
kernel crashes are recorded in something like console-ramoops and the
recovery kernel/modules are different or the debuginfo doesn't exist on
the device due to space concerns (the debuginfo can be too large for space
limited devices).

Originally, I put this on the %pS format, but that was quickly rejected
given that %pS is used in other places such as ftrace where build IDs
aren't meaningful.  There was some discussions on the list to put every
module build ID into the "Modules linked in:" section of the stacktrace
message but that quickly becomes very hard to read once you have more than
three or four modules linked in.  It also provides too much information
when we don't expect each module to be traversed in a stacktrace.  Having
the build ID for modules that aren't important just makes things messy.
Splitting it to multiple lines for each module quickly explodes the number
of lines printed in an oops too, possibly wrapping the warning off the
console.  And finally, trying to stash away each module used in a
callstack to provide the ID of each symbol printed is cumbersome and would
require changes to each architecture to stash away modules and return
their build IDs once unwinding has completed.

Instead, we opt for the simpler approach of introducing new printk formats
'%pS[R]b' for "pointer symbolic backtrace with module build ID" and '%pBb'
for "pointer backtrace with module build ID" and then updating the few
places in the architecture layer where the stacktrace is printed to use
this new format.

Before:

 Call trace:
  lkdtm_WARNING+0x28/0x30 [lkdtm]
  direct_entry+0x16c/0x1b4 [lkdtm]
  full_proxy_write+0x74/0xa4
  vfs_write+0xec/0x2e8

After:

 Call trace:
  lkdtm_WARNING+0x28/0x30 [lkdtm 6c2215028606bda50de823490723dc4bc5bf46f9]
  direct_entry+0x16c/0x1b4 [lkdtm 6c2215028606bda50de823490723dc4bc5bf46f9]
  full_proxy_write+0x74/0xa4
  vfs_write+0xec/0x2e8

[akpm@linux-foundation.org: fix build with CONFIG_MODULES=n, tweak code layout]
[rdunlap@infradead.org: fix build when CONFIG_MODULES is not set]
Link: https://lkml.kernel.org/r/20210513171510.20328-1-rdunlap@infradead.org
[akpm@linux-foundation.org: make kallsyms_lookup_buildid() static]
[cuibixuan@huawei.com: fix build error when CONFIG_SYSFS is disabled]
Link: https://lkml.kernel.org/r/20210525105049.34804-1-cuibixuan@huawei.com
Link: https://lkml.kernel.org/r/20210511003845.2429846-6-swboyd@chromium.org
Link: https://fedoraproject.org/wiki/Releases/FeatureBuildId
Link: https://sourceware.org/elfutils/Debuginfod.html
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agodump_stack: add vmlinux build ID to stack traces
Stephen Boyd [Thu, 8 Jul 2021 01:09:17 +0000 (18:09 -0700)]
dump_stack: add vmlinux build ID to stack traces

Add the running kernel's build ID[1] to the stacktrace information header.
This makes it simpler for developers to locate the vmlinux with full
debuginfo for a particular kernel stacktrace.  Combined with
scripts/decode_stracktrace.sh, a developer can download the correct
vmlinux from a debuginfod[2] server and find the exact file and line
number for the functions plus offsets in a stacktrace.

This is especially useful for pstore crash debugging where the kernel
crashes are recorded in the pstore logs and the recovery kernel is
different or the debuginfo doesn't exist on the device due to space
concerns (the data can be large and a security concern).  The stacktrace
can be analyzed after the crash by using the build ID to find the matching
vmlinux and understand where in the function something went wrong.

Example stacktrace from lkdtm:

 WARNING: CPU: 4 PID: 3255 at drivers/misc/lkdtm/bugs.c:83 lkdtm_WARNING+0x28/0x30 [lkdtm]
 Modules linked in: lkdtm rfcomm algif_hash algif_skcipher af_alg xt_cgroup uinput xt_MASQUERADE
 CPU: 4 PID: 3255 Comm: bash Not tainted 5.11 #3 aa23f7a1231c229de205662d5a9e0d4c580f19a1
 Hardware name: Google Lazor (rev3+) with KB Backlight (DT)
 pstate: 00400009 (nzcv daif +PAN -UAO -TCO BTYPE=--)
 pc : lkdtm_WARNING+0x28/0x30 [lkdtm]

The hex string aa23f7a1231c229de205662d5a9e0d4c580f19a1 is the build ID,
following the kernel version number. Put it all behind a config option,
STACKTRACE_BUILD_ID, so that kernel developers can remove this
information if they decide it is too much.

Link: https://lkml.kernel.org/r/20210511003845.2429846-5-swboyd@chromium.org
Link: https://fedoraproject.org/wiki/Releases/FeatureBuildId
Link: https://sourceware.org/elfutils/Debuginfod.html
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agobuildid: stash away kernels build ID on init
Stephen Boyd [Thu, 8 Jul 2021 01:09:13 +0000 (18:09 -0700)]
buildid: stash away kernels build ID on init

Parse the kernel's build ID at initialization so that other code can print
a hex format string representation of the running kernel's build ID.  This
will be used in the kdump and dump_stack code so that developers can
easily locate the vmlinux debug symbols for a crash/stacktrace.

[swboyd@chromium.org: fix implicit declaration of init_vmlinux_build_id()]
Link: https://lkml.kernel.org/r/CAE-0n51UjTbay8N9FXAyE7_aR2+ePrQnKSRJ0gbmRsXtcLBVaw@mail.gmail.com
Link: https://lkml.kernel.org/r/20210511003845.2429846-4-swboyd@chromium.org
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agobuildid: add API to parse build ID out of buffer
Stephen Boyd [Thu, 8 Jul 2021 01:09:10 +0000 (18:09 -0700)]
buildid: add API to parse build ID out of buffer

Add an API that can parse the build ID out of a buffer, instead of a vma,
to support printing a kernel module's build ID for stack traces.

Link: https://lkml.kernel.org/r/20210511003845.2429846-3-swboyd@chromium.org
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agobuildid: only consider GNU notes for build ID parsing
Stephen Boyd [Thu, 8 Jul 2021 01:09:06 +0000 (18:09 -0700)]
buildid: only consider GNU notes for build ID parsing

Patch series "Add build ID to stacktraces", v6.

This series adds the kernel's build ID[1] to the stacktrace header printed
in oops messages, warnings, etc.  and the build ID for any module that
appears in the stacktrace after the module name.  The goal is to make the
stacktrace more self-contained and descriptive by including the relevant
build IDs in the kernel logs when something goes wrong.  This can be used
by post processing tools like script/decode_stacktrace.sh and kernel
developers to easily locate the debug info associated with a kernel crash
and line up what line and file things started falling apart at.

To show how this can be used I've included a patch to decode_stacktrace.sh
that downloads the debuginfo from a debuginfod server.  This also includes
some patches to make the buildid.c file use more const arguments and
consolidate logic into buildid.c from kdump.  These are left to the end as
they were mostly cleanup patches.

Here's an example lkdtm stacktrace on arm64.

 WARNING: CPU: 4 PID: 3255 at drivers/misc/lkdtm/bugs.c:83 lkdtm_WARNING+0x28/0x30 [lkdtm]
 Modules linked in: lkdtm rfcomm algif_hash algif_skcipher af_alg xt_cgroup uinput xt_MASQUERADE
 CPU: 4 PID: 3255 Comm: bash Not tainted 5.11 #3 aa23f7a1231c229de205662d5a9e0d4c580f19a1
 Hardware name: Google Lazor (rev3+) with KB Backlight (DT)
 pstate: 00400009 (nzcv daif +PAN -UAO -TCO BTYPE=--)
 pc : lkdtm_WARNING+0x28/0x30 [lkdtm]
 lr : lkdtm_do_action+0x24/0x40 [lkdtm]
 sp : ffffffc0134fbca0
 x29: ffffffc0134fbca0 x28: ffffff92d53ba240
 x27: 0000000000000000 x26: 0000000000000000
 x25: 0000000000000000 x24: ffffffe3622352c0
 x23: 0000000000000020 x22: ffffffe362233366
 x21: ffffffe3622352e0 x20: ffffffc0134fbde0
 x19: 0000000000000008 x18: 0000000000000000
 x17: ffffff929b6536fc x16: 0000000000000000
 x15: 0000000000000000 x14: 0000000000000012
 x13: ffffffe380ed892c x12: ffffffe381d05068
 x11: 0000000000000000 x10: 0000000000000000
 x9 : 0000000000000001 x8 : ffffffe362237000
 x7 : aaaaaaaaaaaaaaaa x6 : 0000000000000000
 x5 : 0000000000000000 x4 : 0000000000000001
 x3 : 0000000000000008 x2 : ffffff93fef25a70
 x1 : ffffff93fef15788 x0 : ffffffe3622352e0
 Call trace:
  lkdtm_WARNING+0x28/0x30 [lkdtm ed5019fdf5e53be37cb1ba7899292d7e143b259e]
  direct_entry+0x16c/0x1b4 [lkdtm ed5019fdf5e53be37cb1ba7899292d7e143b259e]
  full_proxy_write+0x74/0xa4
  vfs_write+0xec/0x2e8
  ksys_write+0x84/0xf0
  __arm64_sys_write+0x24/0x30
  el0_svc_common+0xf4/0x1c0
  do_el0_svc_compat+0x28/0x3c
  el0_svc_compat+0x10/0x1c
  el0_sync_compat_handler+0xa8/0xcc
  el0_sync_compat+0x178/0x180
 ---[ end trace 3d95032303e59e68 ]---

This patch (of 13):

Some kernel elf files have various notes that also happen to have an elf
note type of '3', which matches NT_GNU_BUILD_ID but the note name isn't
"GNU".  For example, this note trips up the existing logic:

 Owner  Data size   Description
 Xen    0x00000008  Unknown note type: (0x00000003) description data: 00 00 00 ffffff80 ffffffff ffffffff ffffffff ffffffff

Let's make sure that it is a GNU note when parsing the build ID so that we
can use this function to parse a vmlinux's build ID too.

Link: https://lkml.kernel.org/r/20210511003845.2429846-1-swboyd@chromium.org
Link: https://lkml.kernel.org/r/20210511003845.2429846-2-swboyd@chromium.org
Fixes: bd7525dacd7e ("bpf: Move stack_map_get_build_id into lib")
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Reported-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agox86: convert to setup_initial_init_mm()
Kefeng Wang [Thu, 8 Jul 2021 01:09:03 +0000 (18:09 -0700)]
x86: convert to setup_initial_init_mm()

Use setup_initial_init_mm() helper to simplify code.

Link: https://lkml.kernel.org/r/20210608083418.137226-16-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agosh: convert to setup_initial_init_mm()
Kefeng Wang [Thu, 8 Jul 2021 01:09:00 +0000 (18:09 -0700)]
sh: convert to setup_initial_init_mm()

Use setup_initial_init_mm() helper to simplify code.

Link: https://lkml.kernel.org/r/20210608083418.137226-15-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agos390: convert to setup_initial_init_mm()
Kefeng Wang [Thu, 8 Jul 2021 01:08:57 +0000 (18:08 -0700)]
s390: convert to setup_initial_init_mm()

Use setup_initial_init_mm() helper to simplify code.

Link: https://lkml.kernel.org/r/20210608083418.137226-14-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoriscv: convert to setup_initial_init_mm()
Kefeng Wang [Thu, 8 Jul 2021 01:08:54 +0000 (18:08 -0700)]
riscv: convert to setup_initial_init_mm()

Use setup_initial_init_mm() helper to simplify code.

Link: https://lkml.kernel.org/r/20210608083418.137226-13-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agopowerpc: convert to setup_initial_init_mm()
Kefeng Wang [Thu, 8 Jul 2021 01:08:51 +0000 (18:08 -0700)]
powerpc: convert to setup_initial_init_mm()

Use setup_initial_init_mm() helper to simplify code.

Link: https://lkml.kernel.org/r/20210608083418.137226-12-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoopenrisc: convert to setup_initial_init_mm()
Kefeng Wang [Thu, 8 Jul 2021 01:08:49 +0000 (18:08 -0700)]
openrisc: convert to setup_initial_init_mm()

Use setup_initial_init_mm() helper to simplify code.

Link: https://lkml.kernel.org/r/20210608083418.137226-11-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Stafford Horne <shorne@gmail.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Stafford Horne <shorne@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agonios2: convert to setup_initial_init_mm()
Kefeng Wang [Thu, 8 Jul 2021 01:08:46 +0000 (18:08 -0700)]
nios2: convert to setup_initial_init_mm()

Use setup_initial_init_mm() helper to simplify code.

Link: https://lkml.kernel.org/r/20210608083418.137226-10-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agonds32: convert to setup_initial_init_mm()
Kefeng Wang [Thu, 8 Jul 2021 01:08:43 +0000 (18:08 -0700)]
nds32: convert to setup_initial_init_mm()

Use setup_initial_init_mm() helper to simplify code.

Link: https://lkml.kernel.org/r/20210608083418.137226-9-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Greentime Hu <green.hu@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agom68k: convert to setup_initial_init_mm()
Kefeng Wang [Thu, 8 Jul 2021 01:08:40 +0000 (18:08 -0700)]
m68k: convert to setup_initial_init_mm()

Use setup_initial_init_mm() helper to simplify code.

Link: https://lkml.kernel.org/r/20210608083418.137226-8-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Greg Ungerer <gerg@linux-m68k.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoh8300: convert to setup_initial_init_mm()
Kefeng Wang [Thu, 8 Jul 2021 01:08:37 +0000 (18:08 -0700)]
h8300: convert to setup_initial_init_mm()

Use setup_initial_init_mm() helper to simplify code.

Link: https://lkml.kernel.org/r/20210608083418.137226-7-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agocsky: convert to setup_initial_init_mm()
Kefeng Wang [Thu, 8 Jul 2021 01:08:34 +0000 (18:08 -0700)]
csky: convert to setup_initial_init_mm()

Use setup_initial_init_mm() helper to simplify code.

Link: https://lkml.kernel.org/r/20210608083418.137226-6-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Guo Ren <guoren@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoarm64: convert to setup_initial_init_mm()
Kefeng Wang [Thu, 8 Jul 2021 01:08:31 +0000 (18:08 -0700)]
arm64: convert to setup_initial_init_mm()

Use setup_initial_init_mm() helper to simplify code.

Link: https://lkml.kernel.org/r/20210608083418.137226-5-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoarm: convert to setup_initial_init_mm()
Kefeng Wang [Thu, 8 Jul 2021 01:08:29 +0000 (18:08 -0700)]
arm: convert to setup_initial_init_mm()

Use setup_initial_init_mm() helper to simplify code.

Link: https://lkml.kernel.org/r/20210608083418.137226-4-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoarc: convert to setup_initial_init_mm()
Kefeng Wang [Thu, 8 Jul 2021 01:08:26 +0000 (18:08 -0700)]
arc: convert to setup_initial_init_mm()

Use setup_initial_init_mm() helper to simplify code.

Link: https://lkml.kernel.org/r/20210608083418.137226-3-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Vineet Gupta <vgupta@synopsys.com> arch/arc]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm: add setup_initial_init_mm() helper
Kefeng Wang [Thu, 8 Jul 2021 01:08:22 +0000 (18:08 -0700)]
mm: add setup_initial_init_mm() helper

Patch series "init_mm: cleanup ARCH's text/data/brk setup code", v3.

Add setup_initial_init_mm() helper, then use it to cleanup the text, data
and brk setup code.

This patch (of 15):

Add setup_initial_init_mm() helper to setup kernel text, data and brk.

Link: https://lkml.kernel.org/r/20210608083418.137226-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20210608083418.137226-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm: fix spelling mistakes in header files
Zhen Lei [Thu, 8 Jul 2021 01:08:19 +0000 (18:08 -0700)]
mm: fix spelling mistakes in header files

Fix some spelling mistakes in comments:
successfull ==> successful
potentialy ==> potentially
alloced ==> allocated
indicies ==> indices
wont ==> won't
resposible ==> responsible
dirtyness ==> dirtiness
droppped ==> dropped
alread ==> already
occured ==> occurred
interupts ==> interrupts
extention ==> extension
slighly ==> slightly
Dont't ==> Don't

Link: https://lkml.kernel.org/r/20210531034849.9549-2-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agosecretmem: test: add basic selftest for memfd_secret(2)
Mike Rapoport [Thu, 8 Jul 2021 01:08:15 +0000 (18:08 -0700)]
secretmem: test: add basic selftest for memfd_secret(2)

The test verifies that file descriptor created with memfd_secret does not
allow read/write operations, that secret memory mappings respect
RLIMIT_MEMLOCK and that remote accesses with process_vm_read() and
ptrace() to the secret memory fail.

Link: https://lkml.kernel.org/r/20210518072034.31572-8-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoarch, mm: wire up memfd_secret system call where relevant
Mike Rapoport [Thu, 8 Jul 2021 01:08:11 +0000 (18:08 -0700)]
arch, mm: wire up memfd_secret system call where relevant

Wire up memfd_secret system call on architectures that define
ARCH_HAS_SET_DIRECT_MAP, namely arm64, risc-v and x86.

Link: https://lkml.kernel.org/r/20210518072034.31572-7-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoPM: hibernate: disable when there are active secretmem users
Mike Rapoport [Thu, 8 Jul 2021 01:08:07 +0000 (18:08 -0700)]
PM: hibernate: disable when there are active secretmem users

It is unsafe to allow saving of secretmem areas to the hibernation
snapshot as they would be visible after the resume and this essentially
will defeat the purpose of secret memory mappings.

Prevent hibernation whenever there are active secret memory users.

Link: https://lkml.kernel.org/r/20210518072034.31572-6-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm: introduce memfd_secret system call to create "secret" memory areas
Mike Rapoport [Thu, 8 Jul 2021 01:08:03 +0000 (18:08 -0700)]
mm: introduce memfd_secret system call to create "secret" memory areas

Introduce "memfd_secret" system call with the ability to create memory
areas visible only in the context of the owning process and not mapped not
only to other processes but in the kernel page tables as well.

The secretmem feature is off by default and the user must explicitly
enable it at the boot time.

Once secretmem is enabled, the user will be able to create a file
descriptor using the memfd_secret() system call.  The memory areas created
by mmap() calls from this file descriptor will be unmapped from the kernel
direct map and they will be only mapped in the page table of the processes
that have access to the file descriptor.

Secretmem is designed to provide the following protections:

* Enhanced protection (in conjunction with all the other in-kernel
  attack prevention systems) against ROP attacks.  Seceretmem makes
  "simple" ROP insufficient to perform exfiltration, which increases the
  required complexity of the attack.  Along with other protections like
  the kernel stack size limit and address space layout randomization which
  make finding gadgets is really hard, absence of any in-kernel primitive
  for accessing secret memory means the one gadget ROP attack can't work.
  Since the only way to access secret memory is to reconstruct the missing
  mapping entry, the attacker has to recover the physical page and insert
  a PTE pointing to it in the kernel and then retrieve the contents.  That
  takes at least three gadgets which is a level of difficulty beyond most
  standard attacks.

* Prevent cross-process secret userspace memory exposures.  Once the
  secret memory is allocated, the user can't accidentally pass it into the
  kernel to be transmitted somewhere.  The secreremem pages cannot be
  accessed via the direct map and they are disallowed in GUP.

* Harden against exploited kernel flaws.  In order to access secretmem,
  a kernel-side attack would need to either walk the page tables and
  create new ones, or spawn a new privileged uiserspace process to perform
  secrets exfiltration using ptrace.

The file descriptor based memory has several advantages over the
"traditional" mm interfaces, such as mlock(), mprotect(), madvise().  File
descriptor approach allows explicit and controlled sharing of the memory
areas, it allows to seal the operations.  Besides, file descriptor based
memory paves the way for VMMs to remove the secret memory range from the
userspace hipervisor process, for instance QEMU.  Andy Lutomirski says:

  "Getting fd-backed memory into a guest will take some possibly major
  work in the kernel, but getting vma-backed memory into a guest without
  mapping it in the host user address space seems much, much worse."

memfd_secret() is made a dedicated system call rather than an extension to
memfd_create() because it's purpose is to allow the user to create more
secure memory mappings rather than to simply allow file based access to
the memory.  Nowadays a new system call cost is negligible while it is way
simpler for userspace to deal with a clear-cut system calls than with a
multiplexer or an overloaded syscall.  Moreover, the initial
implementation of memfd_secret() is completely distinct from
memfd_create() so there is no much sense in overloading memfd_create() to
begin with.  If there will be a need for code sharing between these
implementation it can be easily achieved without a need to adjust user
visible APIs.

The secret memory remains accessible in the process context using uaccess
primitives, but it is not exposed to the kernel otherwise; secret memory
areas are removed from the direct map and functions in the
follow_page()/get_user_page() family will refuse to return a page that
belongs to the secret memory area.

Once there will be a use case that will require exposing secretmem to the
kernel it will be an opt-in request in the system call flags so that user
would have to decide what data can be exposed to the kernel.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which
affects the system performance.  However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "...  can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice".  Hence, it is sufficient to
have secretmem disabled by default with the ability of a system
administrator to enable it at boot time.

Pages in the secretmem regions are unevictable and unmovable to avoid
accidental exposure of the sensitive data via swap or during page
migration.

Since the secretmem mappings are locked in memory they cannot exceed
RLIMIT_MEMLOCK.  Since these mappings are already locked independently
from mlock(), an attempt to mlock()/munlock() secretmem range would fail
and mlockall()/munlockall() will ignore secretmem mappings.

However, unlike mlock()ed memory, secretmem currently behaves more like
long-term GUP: secretmem mappings are unmovable mappings directly consumed
by user space.  With default limits, there is no excessive use of
secretmem and it poses no real problem in combination with
ZONE_MOVABLE/CMA, but in the future this should be addressed to allow
balanced use of large amounts of secretmem along with ZONE_MOVABLE/CMA.

A page that was a part of the secret memory area is cleared when it is
freed to ensure the data is not exposed to the next user of that page.

The following example demonstrates creation of a secret mapping (error
handling is omitted):

fd = memfd_secret(0);
ftruncate(fd, MAP_SIZE);
ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
   MAP_SHARED, fd, 0);

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

[akpm@linux-foundation.org: suppress Kconfig whine]

Link: https://lkml.kernel.org/r/20210518072034.31572-5-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoset_memory: allow querying whether set_direct_map_*() is actually enabled
Mike Rapoport [Thu, 8 Jul 2021 01:07:59 +0000 (18:07 -0700)]
set_memory: allow querying whether set_direct_map_*() is actually enabled

On arm64, set_direct_map_*() functions may return 0 without actually
changing the linear map.  This behaviour can be controlled using kernel
parameters, so we need a way to determine at runtime whether calls to
set_direct_map_invalid_noflush() and set_direct_map_default_noflush() have
any effect.

Extend set_memory API with can_set_direct_map() function that allows
checking if calling set_direct_map_*() will actually change the page
table, replace several occurrences of open coded checks in arm64 with the
new function and provide a generic stub for architectures that always
modify page tables upon calls to set_direct_map APIs.

[arnd@arndb.de: arm64: kfence: fix header inclusion ]

Link: https://lkml.kernel.org/r/20210518072034.31572-4-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoriscv/Kconfig: make direct map manipulation options depend on MMU
Mike Rapoport [Thu, 8 Jul 2021 01:07:54 +0000 (18:07 -0700)]
riscv/Kconfig: make direct map manipulation options depend on MMU

ARCH_HAS_SET_DIRECT_MAP and ARCH_HAS_SET_MEMORY configuration options have
no meaning when CONFIG_MMU is disabled and there is no point to enable
them for the nommu case.

Add an explicit dependency on MMU for these options.

Link: https://lkml.kernel.org/r/20210518072034.31572-3-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agommap: make mlock_future_check() global
Mike Rapoport [Thu, 8 Jul 2021 01:07:50 +0000 (18:07 -0700)]
mmap: make mlock_future_check() global

Patch series "mm: introduce memfd_secret system call to create "secret" memory areas", v20.

This is an implementation of "secret" mappings backed by a file
descriptor.

The file descriptor backing secret memory mappings is created using a
dedicated memfd_secret system call The desired protection mode for the
memory is configured using flags parameter of the system call.  The mmap()
of the file descriptor created with memfd_secret() will create a "secret"
memory mapping.  The pages in that mapping will be marked as not present
in the direct map and will be present only in the page table of the owning
mm.

Although normally Linux userspace mappings are protected from other users,
such secret mappings are useful for environments where a hostile tenant is
trying to trick the kernel into giving them access to other tenants
mappings.

It's designed to provide the following protections:

* Enhanced protection (in conjunction with all the other in-kernel
  attack prevention systems) against ROP attacks.  Seceretmem makes
  "simple" ROP insufficient to perform exfiltration, which increases the
  required complexity of the attack.  Along with other protections like
  the kernel stack size limit and address space layout randomization which
  make finding gadgets is really hard, absence of any in-kernel primitive
  for accessing secret memory means the one gadget ROP attack can't work.
  Since the only way to access secret memory is to reconstruct the missing
  mapping entry, the attacker has to recover the physical page and insert
  a PTE pointing to it in the kernel and then retrieve the contents.  That
  takes at least three gadgets which is a level of difficulty beyond most
  standard attacks.

* Prevent cross-process secret userspace memory exposures.  Once the
  secret memory is allocated, the user can't accidentally pass it into the
  kernel to be transmitted somewhere.  The secreremem pages cannot be
  accessed via the direct map and they are disallowed in GUP.

* Harden against exploited kernel flaws.  In order to access secretmem,
  a kernel-side attack would need to either walk the page tables and
  create new ones, or spawn a new privileged uiserspace process to perform
  secrets exfiltration using ptrace.

In the future the secret mappings may be used as a mean to protect guest
memory in a virtual machine host.

For demonstration of secret memory usage we've created a userspace library

https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloader.git

that does two things: the first is act as a preloader for openssl to
redirect all the OPENSSL_malloc calls to secret memory meaning any secret
keys get automatically protected this way and the other thing it does is
expose the API to the user who needs it.  We anticipate that a lot of the
use cases would be like the openssl one: many toolkits that deal with
secret keys already have special handling for the memory to try to give
them greater protection, so this would simply be pluggable into the
toolkits without any need for user application modification.

Hiding secret memory mappings behind an anonymous file allows usage of the
page cache for tracking pages allocated for the "secret" mappings as well
as using address_space_operations for e.g.  page migration callbacks.

The anonymous file may be also used implicitly, like hugetlb files, to
implement mmap(MAP_SECRET) and use the secret memory areas with "native"
mm ABIs in the future.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which
affects the system performance.  However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "...  can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice".  Hence, it is sufficient to
have secretmem disabled by default with the ability of a system
administrator to enable it at boot time.

In addition, there is also a long term goal to improve management of the
direct map.

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

This patch (of 7):

It will be used by the upcoming secret memory implementation.

Link: https://lkml.kernel.org/r/20210518072034.31572-1-rppt@kernel.org
Link: https://lkml.kernel.org/r/20210518072034.31572-2-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/slub: use stackdepot to save stack trace in objects
Oliver Glitta [Thu, 8 Jul 2021 01:07:47 +0000 (18:07 -0700)]
mm/slub: use stackdepot to save stack trace in objects

Many stack traces are similar so there are many similar arrays.
Stackdepot saves each unique stack only once.

Replace field addrs in struct track with depot_stack_handle_t handle.  Use
stackdepot to save stack trace.

The benefits are smaller memory overhead and possibility to aggregate
per-cache statistics in the future using the stackdepot handle instead of
matching stacks manually.

[rdunlap@infradead.org: rename save_stack_trace()]
Link: https://lkml.kernel.org/r/20210513051920.29320-1-rdunlap@infradead.org
[vbabka@suse.cz: fix lockdep splat]
Link: https://lkml.kernel.org/r/20210516195150.26740-1-vbabka@suse.czLink:
Signed-off-by: Oliver Glitta <glittao@gmail.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agohexagon: select ARCH_WANT_LD_ORPHAN_WARN
Nathan Chancellor [Thu, 8 Jul 2021 01:07:44 +0000 (18:07 -0700)]
hexagon: select ARCH_WANT_LD_ORPHAN_WARN

Now that we handle all of the sections in a Hexagon defconfig, select
ARCH_WANT_LD_ORPHAN_WARN so that unhandled sections are warned about by
default.

Link: https://lkml.kernel.org/r/20210521011239.1332345-4-nathan@kernel.org
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Brian Cain <bcain@codeaurora.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Oliver Glitta <glittao@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agohexagon: use common DISCARDS macro
Nathan Chancellor [Thu, 8 Jul 2021 01:07:41 +0000 (18:07 -0700)]
hexagon: use common DISCARDS macro

ld.lld warns that the '.modinfo' section is not currently handled:

ld.lld: warning: kernel/built-in.a(workqueue.o):(.modinfo) is being placed in '.modinfo'
ld.lld: warning: kernel/built-in.a(printk/printk.o):(.modinfo) is being placed in '.modinfo'
ld.lld: warning: kernel/built-in.a(irq/spurious.o):(.modinfo) is being placed in '.modinfo'
ld.lld: warning: kernel/built-in.a(rcu/update.o):(.modinfo) is being placed in '.modinfo'

The '.modinfo' section was added in commit 898490c010b5 ("moduleparam:
Save information about built-in modules in separate file") to the DISCARDS
macro but Hexagon has never used that macro.  The unification of DISCARDS
happened in commit 023bf6f1b8bf ("linker script: unify usage of discard
definition") in 2009, prior to Hexagon being added in 2011.

Switch Hexagon over to the DISCARDS macro so that anything that is
expected to be discarded gets discarded.

Link: https://lkml.kernel.org/r/20210521011239.1332345-3-nathan@kernel.org
Fixes: e95bf452a9e2 ("Hexagon: Add configuration and makefiles for the Hexagon architecture.")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Brian Cain <bcain@codeaurora.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Oliver Glitta <glittao@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agohexagon: handle {,SOFT}IRQENTRY_TEXT in linker script
Nathan Chancellor [Thu, 8 Jul 2021 01:07:38 +0000 (18:07 -0700)]
hexagon: handle {,SOFT}IRQENTRY_TEXT in linker script

Patch series "hexagon: Fix build error with CONFIG_STACKDEPOT and select CONFIG_ARCH_WANT_LD_ORPHAN_WARN".

This series fixes an error with ARCH=hexagon that was pointed out by the
patch "mm/slub: use stackdepot to save stack trace in objects".

The first patch fixes that error by handling the '.irqentry.text' and
'.softirqentry.text' sections.

The second patch switches Hexagon over to the common DISCARDS macro, which
should have been done when Hexagon was merged into the tree to match
commit 023bf6f1b8bf ("linker script: unify usage of discard definition").

The third patch selects CONFIG_ARCH_WANT_LD_ORPHAN_WARN so that something
like this does not happen again.

This patch (of 3):

Patch "mm/slub: use stackdepot to save stack trace in objects" in -mm
selects CONFIG_STACKDEPOT when CONFIG_STACKTRACE_SUPPORT is selected and
CONFIG_STACKDEPOT requires IRQENTRY_TEXT and SOFTIRQENTRY_TEXT to be
handled after commit 505a0ef15f96 ("kasan: stackdepot: move
filter_irq_stacks() to stackdepot.c") due to the use of the
__{,soft}irqentry_text_{start,end} section symbols.  If those sections are
not handled, the build is broken.

$ make ARCH=hexagon CROSS_COMPILE=hexagon-linux- LLVM=1 LLVM_IAS=1 defconfig all
...
ld.lld: error: undefined symbol: __irqentry_text_start
>>> referenced by stackdepot.c
>>>               stackdepot.o:(filter_irq_stacks) in archive lib/built-in.a
>>> referenced by stackdepot.c
>>>               stackdepot.o:(filter_irq_stacks) in archive lib/built-in.a

ld.lld: error: undefined symbol: __irqentry_text_end
>>> referenced by stackdepot.c
>>>               stackdepot.o:(filter_irq_stacks) in archive lib/built-in.a
>>> referenced by stackdepot.c
>>>               stackdepot.o:(filter_irq_stacks) in archive lib/built-in.a

ld.lld: error: undefined symbol: __softirqentry_text_start
>>> referenced by stackdepot.c
>>>               stackdepot.o:(filter_irq_stacks) in archive lib/built-in.a
>>> referenced by stackdepot.c
>>>               stackdepot.o:(filter_irq_stacks) in archive lib/built-in.a

ld.lld: error: undefined symbol: __softirqentry_text_end
>>> referenced by stackdepot.c
>>>               stackdepot.o:(filter_irq_stacks) in archive lib/built-in.a
>>> referenced by stackdepot.c
>>>               stackdepot.o:(filter_irq_stacks) in archive lib/built-in.a
...

Add these sections to the Hexagon linker script so the build continues to
work.  ld.lld's orphan section warning would have caught this prior to the
-mm commit mentioned above:

ld.lld: warning: kernel/built-in.a(softirq.o):(.softirqentry.text) is being placed in '.softirqentry.text'
ld.lld: warning: kernel/built-in.a(softirq.o):(.softirqentry.text) is being placed in '.softirqentry.text'
ld.lld: warning: kernel/built-in.a(softirq.o):(.softirqentry.text) is being placed in '.softirqentry.text'

Link: https://lkml.kernel.org/r/20210521011239.1332345-1-nathan@kernel.org
Link: https://lkml.kernel.org/r/20210521011239.1332345-2-nathan@kernel.org
Link: https://github.com/ClangBuiltLinux/linux/issues/1381
Fixes: 505a0ef15f96 ("kasan: stackdepot: move filter_irq_stacks() to stackdepot.c")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Brian Cain <bcain@codeaurora.org>
Cc: Oliver Glitta <glittao@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agolib: fix spelling mistakes in header files
Zhen Lei [Thu, 8 Jul 2021 01:07:34 +0000 (18:07 -0700)]
lib: fix spelling mistakes in header files

Fix some spelling mistakes in comments found by "codespell":
Hoever ==> However
poiter ==> pointer
representaion ==> representation
uppon ==> upon
independend ==> independent
aquired ==> acquired
mis-match ==> mismatch
scrach ==> scratch
struture ==> structure
Analagous ==> Analogous
interation ==> iteration

And some were discovered manually by Joe Perches and Christoph Lameter:
stroed ==> stored
arch independent ==> an architecture independent
A example structure for ==> Example structure for

Link: https://lkml.kernel.org/r/20210609150027.14805-2-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Christoph Lameter <cl@gentwo.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agolib: fix spelling mistakes
Zhen Lei [Thu, 8 Jul 2021 01:07:31 +0000 (18:07 -0700)]
lib: fix spelling mistakes

Fix some spelling mistakes in comments:
permanentely ==> permanently
wont ==> won't
remaning ==> remaining
succed ==> succeed
shouldnt ==> shouldn't
alpha-numeric ==> alphanumeric
storeing ==> storing
funtion ==> function
documenation ==> documentation
Determin ==> Determine
intepreted ==> interpreted
ammount ==> amount
obious ==> obvious
interupts ==> interrupts
occured ==> occurred
asssociated ==> associated
taking into acount ==> taking into account
squence ==> sequence
stil ==> still
contiguos ==> contiguous
matchs ==> matches

Link: https://lkml.kernel.org/r/20210607072555.12416-1-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agolib/test: fix spelling mistakes
Zhen Lei [Thu, 8 Jul 2021 01:07:28 +0000 (18:07 -0700)]
lib/test: fix spelling mistakes

Fix some spelling mistakes in comments found by "codespell":
thats ==> that's
unitialized ==> uninitialized
panicing ==> panicking
sucess ==> success
possitive ==> positive
intepreted ==> interpreted

Link: https://lkml.kernel.org/r/20210607133036.12525-2-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Acked-by: Yonghong Song <yhs@fb.com> [test_bfp.c]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoFix UCOUNT_RLIMIT_SIGPENDING counter leak
Alexey Gladkov [Thu, 8 Jul 2021 10:33:01 +0000 (12:33 +0200)]
Fix UCOUNT_RLIMIT_SIGPENDING counter leak

We must properly handle an errors when we increase the rlimit counter
and the ucounts reference counter. We have to this with RCU protection
to prevent possible use-after-free that could occur due to concurrent
put_cred_rcu().

The following reproducer triggers the problem:

  $ cat testcase.sh
  case "${STEP:-0}" in
  0)
ulimit -Si 1
ulimit -Hi 1
STEP=1 unshare -rU "$0"
killall sleep
;;
  1)
for i in 1 2 3 4 5; do unshare -rU sleep 5 & done
;;
  esac

with the KASAN report being along the lines of

  BUG: KASAN: use-after-free in put_ucounts+0x17/0xa0
  Write of size 4 at addr ffff8880045f031c by task swapper/2/0

  CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.13.0+ #19
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-alt4 04/01/2014
  Call Trace:
   <IRQ>
   put_ucounts+0x17/0xa0
   put_cred_rcu+0xd5/0x190
   rcu_core+0x3bf/0xcb0
   __do_softirq+0xe3/0x341
   irq_exit_rcu+0xbe/0xe0
   sysvec_apic_timer_interrupt+0x6a/0x90
   </IRQ>
   asm_sysvec_apic_timer_interrupt+0x12/0x20
   default_idle_call+0x53/0x130
   do_idle+0x311/0x3c0
   cpu_startup_entry+0x14/0x20
   secondary_startup_64_no_verify+0xc2/0xcb

  Allocated by task 127:
   kasan_save_stack+0x1b/0x40
   __kasan_kmalloc+0x7c/0x90
   alloc_ucounts+0x169/0x2b0
   set_cred_ucounts+0xbb/0x170
   ksys_unshare+0x24c/0x4e0
   __x64_sys_unshare+0x16/0x20
   do_syscall_64+0x37/0x70
   entry_SYSCALL_64_after_hwframe+0x44/0xae

  Freed by task 0:
   kasan_save_stack+0x1b/0x40
   kasan_set_track+0x1c/0x30
   kasan_set_free_info+0x20/0x30
   __kasan_slab_free+0xeb/0x120
   kfree+0xaa/0x460
   put_cred_rcu+0xd5/0x190
   rcu_core+0x3bf/0xcb0
   __do_softirq+0xe3/0x341

  The buggy address belongs to the object at ffff8880045f0300
   which belongs to the cache kmalloc-192 of size 192
  The buggy address is located 28 bytes inside of
   192-byte region [ffff8880045f0300ffff8880045f03c0)
  The buggy address belongs to the page:
  page:000000008de0a388 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff8880045f0000 pfn:0x45f0
  flags: 0x100000000000200(slab|node=0|zone=1)
  raw: 0100000000000200 ffffea00000f4640 0000000a0000000a ffff888001042a00
  raw: ffff8880045f0000 000000008010000d 00000001ffffffff 0000000000000000
  page dumped because: kasan: bad access detected

  Memory state around the buggy address:
   ffff8880045f0200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff8880045f0280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
  >ffff8880045f0300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                              ^
   ffff8880045f0380: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
   ffff8880045f0400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ==================================================================
  Disabling lock debugging due to kernel taint

Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoMerge part 2 of branch 'sysfs-devel'
Trond Myklebust [Sat, 26 Jun 2021 16:20:23 +0000 (12:20 -0400)]
Merge part 2 of branch 'sysfs-devel'

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agoNFSv4/pNFS: Return an error if _nfs4_pnfs_v3_ds_connect can't load NFSv3
Trond Myklebust [Sat, 3 Jul 2021 18:31:49 +0000 (14:31 -0400)]
NFSv4/pNFS: Return an error if _nfs4_pnfs_v3_ds_connect can't load NFSv3

Currently we fail to return an error if the NFSv3 module failed to load
when we're trying to connect to a pNFS data server.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agoNFSv4/pNFS: Don't call _nfs4_pnfs_v3_ds_connect multiple times
Trond Myklebust [Sat, 3 Jul 2021 18:34:20 +0000 (14:34 -0400)]
NFSv4/pNFS: Don't call _nfs4_pnfs_v3_ds_connect multiple times

After we grab the lock in nfs4_pnfs_ds_connect(), there is no check for
whether or not ds->ds_clp has already been initialised, so we can end up
adding the same transports multiple times.

Fixes: fc821d59209d ("pnfs/NFSv4.1: Add multipath capabilities to pNFS flexfiles servers over NFSv3")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agoNFSv4/pnfs: Clean up layout get on open
Trond Myklebust [Fri, 2 Jul 2021 21:24:22 +0000 (17:24 -0400)]
NFSv4/pnfs: Clean up layout get on open

Cache the layout in the arguments so we don't have to keep looking it up
from the inode.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agoNFSv4/pnfs: Fix layoutget behaviour after invalidation
Trond Myklebust [Fri, 2 Jul 2021 23:48:41 +0000 (19:48 -0400)]
NFSv4/pnfs: Fix layoutget behaviour after invalidation

If the layout gets invalidated, we should wait for any outstanding
layoutget requests for that layout to complete, and we should resend
them only after re-establishing the layout stateid.

Fixes: d29b468da4f9 ("pNFS/NFSv4: Improve rejection of out-of-order layouts")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agoNFSv4/pnfs: Fix the layout barrier update
Trond Myklebust [Fri, 2 Jul 2021 20:37:15 +0000 (16:37 -0400)]
NFSv4/pnfs: Fix the layout barrier update

If we have multiple outstanding layoutget requests, the current code to
update the layout barrier assumes that the outstanding layout stateids
are updated in order. That's not necessarily the case.

Instead of using the value of lo->plh_outstanding as a guesstimate for
the window of values we need to accept, just wait to update the window
until we're processing the last one. The intention here is just to
ensure that we don't process 2^31 seqid updates without also updating
the barrier.

Fixes: 1bcf34fdac5f ("pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agoNFS: Fix fscache read from NFS after cache error
Dave Wysochanski [Tue, 29 Jun 2021 17:13:57 +0000 (13:13 -0400)]
NFS: Fix fscache read from NFS after cache error

Earlier commits refactored some NFS read code and removed
nfs_readpage_async(), but neglected to properly fixup
nfs_readpage_from_fscache_complete().  The code path is
only hit when something unusual occurs with the cachefiles
backing filesystem, such as an IO error or while a cookie
is being invalidated.

Mark page with PG_checked if fscache IO completes in error,
unlock the page, and let the VM decide to re-issue based on
PG_uptodate.  When the VM reissues the readpage, PG_checked
allows us to skip over fscache and read from the server.

Link: https://marc.info/?l=linux-nfs&m=162498209518739
Fixes: 1e83b173b266 ("NFS: Add nfs_pageio_complete_read() and remove nfs_readpage_async()")
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agoNFS: Ensure nfs_readpage returns promptly when internal error occurs
Dave Wysochanski [Tue, 29 Jun 2021 09:11:28 +0000 (05:11 -0400)]
NFS: Ensure nfs_readpage returns promptly when internal error occurs

A previous refactoring of nfs_readpage() might end up calling
wait_on_page_locked_killable() even if readpage_async_filler() failed
with an internal error and pg_error was non-zero (for example, if
nfs_create_request() failed).  In the case of an internal error,
skip over wait_on_page_locked_killable() as this is only needed
when the read is sent and an error occurs during completion handling.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agoMerge branch 'sysfs-devel'
Trond Myklebust [Mon, 21 Jun 2021 15:39:45 +0000 (11:39 -0400)]
Merge branch 'sysfs-devel'

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agosunrpc: remove an offlined xprt using sysfs
Olga Kornievskaia [Thu, 24 Jun 2021 03:28:53 +0000 (23:28 -0400)]
sunrpc: remove an offlined xprt using sysfs

Once a transport has been put offline, this transport can be also
removed from the list of transports. Any tasks that have been stuck
on this transport would find the next available active transport
and be re-tried. This transport would be removed from the xprt_switch
list and freed.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agosunrpc: provide showing transport's state info in the sysfs directory
Olga Kornievskaia [Tue, 8 Jun 2021 19:59:22 +0000 (15:59 -0400)]
sunrpc: provide showing transport's state info in the sysfs directory

In preparation of being able to change the xprt's state, add a way
to show currect state of the transport.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agosunrpc: display xprt's queuelen of assigned tasks via sysfs
Olga Kornievskaia [Thu, 24 Jun 2021 03:28:52 +0000 (23:28 -0400)]
sunrpc: display xprt's queuelen of assigned tasks via sysfs

Once a task grabs a trasnport it's reflected in the queuelen of
the rpc_xprt structure. Add display of that value in the xprt's
info file in sysfs.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agosunrpc: provide multipath info in the sysfs directory
Olga Kornievskaia [Tue, 8 Jun 2021 19:59:21 +0000 (15:59 -0400)]
sunrpc: provide multipath info in the sysfs directory

Allow to query xrpt_switch attributes. Currently showing the following
fields of the rpc_xprt_switch structure: xps_nxprts, xps_nactive,
xps_queuelen.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agoNFSv4.1 identify and mark RPC tasks that can move between transports
Olga Kornievskaia [Thu, 24 Jun 2021 03:28:51 +0000 (23:28 -0400)]
NFSv4.1 identify and mark RPC tasks that can move between transports

In preparation for when we can re-try a task on a different transport,
identify and mark such RPC tasks as moveable. Only 4.1+ operarations can
be re-tried on a different transport.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agosunrpc: provide transport info in the sysfs directory
Olga Kornievskaia [Tue, 8 Jun 2021 19:59:20 +0000 (15:59 -0400)]
sunrpc: provide transport info in the sysfs directory

Allow to query transport's attributes. Currently showing following
fields of the rpc_xprt structure: state, last_used, cong, cwnd,
max_reqs, min_reqs, num_reqs, sizes of queues binding, sending,
pending, backlog.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agoSUNRPC: take a xprt offline using sysfs
Olga Kornievskaia [Thu, 24 Jun 2021 03:28:50 +0000 (23:28 -0400)]
SUNRPC: take a xprt offline using sysfs

Using sysfs's xprt_state attribute, mark a particular transport offline.
It will not be picked during the round-robin selection. It's not allowed
to take the main (1st created transport associated with the rpc_client)
offline. Also bring a transport back online via sysfs by writing "online"
and that would allow for this transport to be picked during the round-
robin selection.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agosunrpc: add dst_attr attributes to the sysfs xprt directory
Olga Kornievskaia [Tue, 8 Jun 2021 19:59:19 +0000 (15:59 -0400)]
sunrpc: add dst_attr attributes to the sysfs xprt directory

Allow to query and set the destination's address of a transport.
Setting of the destination address is allowed only for TCP or RDMA
based connections.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agoSUNRPC for TCP display xprt's source port in sysfs xprt_info
Olga Kornievskaia [Thu, 24 Jun 2021 03:28:49 +0000 (23:28 -0400)]
SUNRPC for TCP display xprt's source port in sysfs xprt_info

Using TCP connection's source port it is useful to match connections
seen on the network traces to the xprts used by the linux nfs client.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agoSUNRPC query transport's source port
Olga Kornievskaia [Thu, 24 Jun 2021 03:28:48 +0000 (23:28 -0400)]
SUNRPC query transport's source port

Provide ability to query transport's source port.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agoSUNRPC display xprt's main value in sysfs's xprt_info
Olga Kornievskaia [Thu, 24 Jun 2021 03:28:47 +0000 (23:28 -0400)]
SUNRPC display xprt's main value in sysfs's xprt_info

Display in sysfs in the information about the xprt if this is a
main transport or not.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agoSUNRPC mark the first transport
Olga Kornievskaia [Thu, 24 Jun 2021 03:28:46 +0000 (23:28 -0400)]
SUNRPC mark the first transport

When an RPC client gets created it's first transport is special
and should be marked a main transport.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agosunrpc: add add sysfs directory per xprt under each xprt_switch
Olga Kornievskaia [Tue, 8 Jun 2021 19:59:18 +0000 (15:59 -0400)]
sunrpc: add add sysfs directory per xprt under each xprt_switch

Add individual transport directories under each transport switch
group. For instance, for each nconnect=X connections there will be
a transport directory. Naming conventions also identifies transport
type -- xprt-<id>-<type> where type is udp, tcp, rdma, local, bc.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agosunrpc: add a symlink from rpc-client directory to the xprt_switch
Olga Kornievskaia [Tue, 8 Jun 2021 19:59:17 +0000 (15:59 -0400)]
sunrpc: add a symlink from rpc-client directory to the xprt_switch

An rpc client uses a transport switch and one ore more transports
associated with that switch. Since transports are shared among
rpc clients, create a symlink into the xprt_switch directory
instead of duplicating entries under each rpc client.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agosunrpc: add xprt_switch direcotry to sunrpc's sysfs
Olga Kornievskaia [Tue, 8 Jun 2021 19:59:16 +0000 (15:59 -0400)]
sunrpc: add xprt_switch direcotry to sunrpc's sysfs

Add xprt_switch directory to the sysfs and create individual
xprt_swith subdirectories for multipath transport group.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agosunrpc: keep track of the xprt_class in rpc_xprt structure
Olga Kornievskaia [Tue, 8 Jun 2021 19:59:15 +0000 (15:59 -0400)]
sunrpc: keep track of the xprt_class in rpc_xprt structure

We need to keep track of the type for a given transport.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agosunrpc: add IDs to multipath
Olga Kornievskaia [Tue, 8 Jun 2021 19:59:14 +0000 (15:59 -0400)]
sunrpc: add IDs to multipath

This is used to uniquely identify sunrpc multipath objects in /sys.

Signed-off-by: Dan Aloni <dan@kernelim.com>
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agosunrpc: add xprt id
Olga Kornievskaia [Tue, 8 Jun 2021 19:59:13 +0000 (15:59 -0400)]
sunrpc: add xprt id

This adds a unique identifier for a sunrpc transport in sysfs, which is
similarly managed to the unique IDs of clients.

Signed-off-by: Dan Aloni <dan@kernelim.com>
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agosunrpc: Create per-rpc_clnt sysfs kobjects
Olga Kornievskaia [Tue, 8 Jun 2021 19:59:12 +0000 (15:59 -0400)]
sunrpc: Create per-rpc_clnt sysfs kobjects

These will eventually have files placed under them for sysfs operations.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agosunrpc: Create a client/ subdirectory in the sunrpc sysfs
Olga Kornievskaia [Tue, 8 Jun 2021 19:59:11 +0000 (15:59 -0400)]
sunrpc: Create a client/ subdirectory in the sunrpc sysfs

For network namespace separation.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agosunrpc: Create a sunrpc directory under /sys/kernel/
Olga Kornievskaia [Tue, 8 Jun 2021 19:59:10 +0000 (15:59 -0400)]
sunrpc: Create a sunrpc directory under /sys/kernel/

This is where we'll put per-rpc_client related files

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
3 years agoftrace: Use list_move instead of list_del/list_add
Baokun Li [Tue, 8 Jun 2021 03:11:08 +0000 (11:11 +0800)]
ftrace: Use list_move instead of list_del/list_add

Using list_move() instead of list_del() + list_add().

Link: https://lkml.kernel.org/r/20210608031108.2820996-1-libaokun1@huawei.com
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
3 years agotracing/selftests: Add tests to test histogram sym and sym-offset modifiers
Steven Rostedt (VMware) [Wed, 7 Jul 2021 16:14:51 +0000 (12:14 -0400)]
tracing/selftests: Add tests to test histogram sym and sym-offset modifiers

Add a test to the tracing selftests that will catch if the .sym or
.sym-offset modifiers break in the future.

Link: https://lkml.kernel.org/r/20210707121451.101a1002@oasis.local.home
Acked-by: Tom Zanussi <zanussi@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
3 years agopowerpc/preempt: Don't touch the idle task's preempt_count during hotplug
Valentin Schneider [Wed, 7 Jul 2021 18:38:31 +0000 (19:38 +0100)]
powerpc/preempt: Don't touch the idle task's preempt_count during hotplug

Powerpc currently resets a CPU's idle task preempt_count to 0 before
said task starts executing the secondary startup routine (and becomes an
idle task proper).

This conflicts with commit f1a0a376ca0c ("sched/core: Initialize the
idle task with preemption disabled").

which initializes all of the idle tasks' preempt_count to
PREEMPT_DISABLED during smp_init(). Note that this was superfluous
before said commit, as back then the hotplug machinery would invoke
init_idle() via idle_thread_get(), which would have already reset the
CPU's idle task's preempt_count to PREEMPT_ENABLED.

Get rid of this preempt_count write.

Fixes: f1a0a376ca0c ("sched/core: Initialize the idle task with preemption disabled")
Reported-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210707183831.2106509-1-valentin.schneider@arm.com
3 years agos390/vdso: add minimal compat vdso
Sven Schnelle [Fri, 25 Jun 2021 12:50:08 +0000 (14:50 +0200)]
s390/vdso: add minimal compat vdso

Add a small vdso for 31 bit compat application that provides
trampolines for calls to sigreturn,rt_sigreturn,syscall_restart.
This is requird for moving these syscalls away from the signal
frame to the vdso. Note that this patch effectively disables
CONFIG_COMPAT when using clang to compile the kernel. clang
doesn't support 31 bit mode.

We want to redirect sigreturn and restart_syscall to the vdso. However,
the kernel cannot parse the ELF vdso file, so we need to generate header
files which contain the offsets of the syscall instructions in the vdso
page.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
3 years agos390/vdso: rename VDSO64_LBASE to VDSO_LBASE
Sven Schnelle [Fri, 25 Jun 2021 13:10:35 +0000 (15:10 +0200)]
s390/vdso: rename VDSO64_LBASE to VDSO_LBASE

Will be used by both vdso32 and vdso64, so change the name.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
3 years agos390/vdso64: add sigreturn,rt_sigreturn and restart_syscall
Sven Schnelle [Fri, 25 Jun 2021 12:48:28 +0000 (14:48 +0200)]
s390/vdso64: add sigreturn,rt_sigreturn and restart_syscall

Add minimalistic trampolines to vdso64 so we can return from signal
without using the stack which requires pgm check handler hacks when
NX is enabled.

restart_syscall will be called from vdso to work around the architectural
limitation that the syscall number might be encoded in the svc instruction,
and therefore can not be changed.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
3 years agos390/vdso: always enable vdso
Sven Schnelle [Wed, 23 Jun 2021 12:10:00 +0000 (14:10 +0200)]
s390/vdso: always enable vdso

With the upcoming move of the svc sigreturn instruction from
the signal frame to vdso we need to have vdso always enabled.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
3 years agos390/ap: get rid of register asm
Heiko Carstens [Mon, 5 Jul 2021 10:06:55 +0000 (12:06 +0200)]
s390/ap: get rid of register asm

Using register asm statements has been proven to be very error prone,
especially when using code instrumentation where gcc may add function
calls, which clobbers register contents in an unexpected way.

Therefore get rid of register asm statements in ap code. There are also
potential bugs, depending on inline decisions of the compiler.

E.g. for:

static inline struct ap_queue_status ap_tapq(ap_qid_t qid, unsigned long *info)
{
register unsigned long reg0 asm ("0") = qid;
register struct ap_queue_status reg1 asm ("1");
register unsigned long reg2 asm ("2");

asm volatile(".long 0xb2af0000"         /* PQAP(TAPQ) */
     : "=d" (reg1), "=d" (reg2)
     : "d" (reg0)
     : "cc");
if (info)
*info = reg2;
return reg1;
}

In case of KCOV the "if (info)" line could cause a generated function
call, which could clobber the contents of both reg2, and reg1.

Similar can happen in case of KASAN for the "*info = reg2" line.

Even though compilers will likely inline the function and optimize
things away, this is not guaranteed.

To get rid of this bug class, simply get rid of register asm constructs.

Note: The inline function ap_dqap() will be handled in a
separate patch because this one requires an addressing of the
odd register of a register pair (which is done with %N[xxx] in
the assembler code) and that's currently not supported by clang.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
3 years agos390/irq: remove HAVE_IRQ_EXIT_ON_IRQ_STACK
Sven Schnelle [Tue, 6 Jul 2021 18:24:58 +0000 (20:24 +0200)]
s390/irq: remove HAVE_IRQ_EXIT_ON_IRQ_STACK

This is no longer true since we switched to generic entry. The code
switches to the IRQ stack before calling do_IRQ, but switches back
to the kernel stack before calling irq_exit().

Fixes: 56e62a737028 ("s390: convert to generic entry")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
3 years agos390/traps: do not test MONITOR CALL without CONFIG_BUG
Ilya Leoshkevich [Fri, 2 Jul 2021 13:54:55 +0000 (15:54 +0200)]
s390/traps: do not test MONITOR CALL without CONFIG_BUG

tinyconfig fails to boot, because without CONFIG_BUG report_bug()
always returns BUG_TRAP_TYPE_BUG, which causes mc 0,0 in
test_monitor_call() to panic. Fix by skipping the test without
CONFIG_BUG.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
3 years agos390/ap: Rework ap_dqap to deal with messages greater than recv buffer
Harald Freudenberger [Wed, 30 Jun 2021 14:10:56 +0000 (16:10 +0200)]
s390/ap: Rework ap_dqap to deal with messages greater than recv buffer

Rework of the ap_dqap() inline function with the dqap inline assembler
invocation and the caller code in ap_queue.c to be able to handle
replies which exceed the receive buffer size.

ap_dqap() now provides two additional parameters to handle together
with the caller the case where a reply in the firmware queue entry
exceeds the given message buffer size. It depends on the caller how to
exactly handle this. The behavior implemented now by ap_sm_recv() in
ap_queue.c is to simple purge this entry from the firmware queue and
let the caller 'receive' a -EMSGSIZE for the request without
delivering any reply data - not even a truncated reply message.

However, the reworked ap_dqap() could now get invoked in a way that
the message is received in multiple parts and the caller assembles the
parts into one reply message.

Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Suggested-by: Juergen Christ <jchrist@linux.ibm.com>
Reviewed-by: Juergen Christ <jchrist@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
3 years agoext4: inline jbd2_journal_[un]register_shrinker()
Theodore Ts'o [Fri, 2 Jul 2021 22:05:03 +0000 (18:05 -0400)]
ext4: inline jbd2_journal_[un]register_shrinker()

The function jbd2_journal_unregister_shrinker() was getting called
twice when the file system was getting unmounted.  On Power and ARM
platforms this was causing kernel crash when unmounting the file
system, when a percpu_counter was destroyed twice.

Fix this by removing jbd2_journal_[un]register_shrinker() functions,
and inlining the shrinker setup and teardown into
journal_init_common() and jbd2_journal_destroy().  This means that
ext4 and ocfs2 now no longer need to know about registering and
unregistering jbd2's shrinker.

Also, while we're at it, rename the percpu counter from
j_jh_shrink_count to j_checkpoint_jh_count, since this makes it
clearer what this counter is intended to track.

Link: https://lore.kernel.org/r/20210705145025.3363130-1-tytso@mit.edu
Fixes: 4ba3fcdde7e3 ("jbd2,ext4: add a shrinker to release checkpointed buffers")
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
3 years agoext4: fix flags validity checking for EXT4_IOC_CHECKPOINT
Theodore Ts'o [Fri, 2 Jul 2021 17:21:06 +0000 (13:21 -0400)]
ext4: fix flags validity checking for EXT4_IOC_CHECKPOINT

Use the correct bitmask when checking for any not-yet-supported flags.

Link: https://lore.kernel.org/r/20210702173425.1276158-1-tytso@mit.edu
Fixes: 351a0a3fbc35 ("ext4: add ioctl EXT4_IOC_CHECKPOINT")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Leah Rumancik <leah.rumancik@gmail.com>
3 years agoext4: fix possible UAF when remounting r/o a mmp-protected file system
Theodore Ts'o [Fri, 2 Jul 2021 16:45:02 +0000 (12:45 -0400)]
ext4: fix possible UAF when remounting r/o a mmp-protected file system

After commit 618f003199c6 ("ext4: fix memory leak in
ext4_fill_super"), after the file system is remounted read-only, there
is a race where the kmmpd thread can exit, causing sbi->s_mmp_tsk to
point at freed memory, which the call to ext4_stop_mmpd() can trip
over.

Fix this by only allowing kmmpd() to exit when it is stopped via
ext4_stop_mmpd().

Link: https://lore.kernel.org/r/20210707002433.3719773-1-tytso@mit.edu
Reported-by: Ye Bin <yebin10@huawei.com>
Bug-Report-Link: <20210629143603.2166962-1-yebin10@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
3 years agovirtio-mem: prioritize unplug from ZONE_MOVABLE in Big Block Mode
David Hildenbrand [Wed, 2 Jun 2021 18:57:20 +0000 (20:57 +0200)]
virtio-mem: prioritize unplug from ZONE_MOVABLE in Big Block Mode

Let's handle unplug in Big Block Mode similar to Sub Block Mode --
prioritize memory blocks onlined to ZONE_MOVABLE.

We won't care further about big blocks with mixed zones, as it's
rather a corner case that won't matter in practice.

Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20210602185720.31821-8-david@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
3 years agovirtio-mem: simplify high-level unplug handling in Big Block Mode
David Hildenbrand [Wed, 2 Jun 2021 18:57:19 +0000 (20:57 +0200)]
virtio-mem: simplify high-level unplug handling in Big Block Mode

Let's simplify high-level big block selection when unplugging in
Big Block Mode.

Combine handling of offline and online blocks. We can get rid of
virtio_mem_bbm_bb_is_offline() and simply use
virtio_mem_bbm_offline_remove_and_unplug_bb(), as that already tolerates
offline parts.

We can race with concurrent onlining/offlining either way, so we don;t
have to be super correct by failing if an offline big block we'd like to
unplug just got (partially) onlined.

Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20210602185720.31821-7-david@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
3 years agovirtio-mem: prioritize unplug from ZONE_MOVABLE in Sub Block Mode
David Hildenbrand [Wed, 2 Jun 2021 18:57:18 +0000 (20:57 +0200)]
virtio-mem: prioritize unplug from ZONE_MOVABLE in Sub Block Mode

Until now, memory provided by a single virtio-mem device was usually
either onlined completely to ZONE_MOVABLE (online_movable) or to
ZONE_NORMAL (online_kernel); however, that will change in the future.

There are two reasons why we want to track to which zone a memory blocks
belongs to and prioritize ZONE_MOVABLE blocks:

1) Memory managed by ZONE_MOVABLE can more likely get unplugged, therefore,
   resulting in a faster memory hotunplug process. Further, we can more
   reliably unplug and remove complete memory blocks, removing metadata
   allocated for the whole memory block.

2) We want to avoid corner cases where unplugging with the current scheme
   (highest to lowest address) could result in accidential zone imbalances,
   whereby we remove too much ZONE_NORMAL memory for ZONE_MOVABLE memory
   of the same device.

Let's track the zone via memory block states and try unplug from
ZONE_MOVABLE first. Rename VIRTIO_MEM_SBM_MB_ONLINE* to
VIRTIO_MEM_SBM_MB_KERNEL* to avoid even longer state names.

In commit 27f852795a06 ("virtio-mem: don't special-case ZONE_MOVABLE"),
we removed slightly similar tracking for fully plugged memory blocks to
support unplugging from ZONE_MOVABLE at all -- as we didn't allow partially
plugged memory blocks in ZONE_MOVABLE before that. That commit already
mentioned "In the future, we might want to remember the zone again and use
the information when (un)plugging memory."

Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20210602185720.31821-6-david@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
3 years agovirtio-mem: simplify high-level unplug handling in Sub Block Mode
David Hildenbrand [Wed, 2 Jun 2021 18:57:17 +0000 (20:57 +0200)]
virtio-mem: simplify high-level unplug handling in Sub Block Mode

Let's simplify by introducing a new virtio_mem_sbm_unplug_any_sb(),
similar to virtio_mem_sbm_plug_any_sb(), to simplify high-level memory
block selection when unplugging in Sub Block Mode.

Rename existing virtio_mem_sbm_unplug_any_sb() to
virtio_mem_sbm_unplug_any_sb_raw().

The only change is that we now temporarily unlock the hotplug mutex around
cond_resched() when processing offline memory blocks, which doesn't
make a real difference as we already have to temporarily unlock in
virtio_mem_sbm_unplug_any_sb_offline() when removing a memory block.

Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20210602185720.31821-5-david@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
3 years agovirtio-mem: simplify high-level plug handling in Sub Block Mode
David Hildenbrand [Wed, 2 Jun 2021 18:57:16 +0000 (20:57 +0200)]
virtio-mem: simplify high-level plug handling in Sub Block Mode

Let's simplify high-level memory block selection when plugging in Sub
Block Mode.

No need for two separate loops when selecting memory blocks for plugging
memory. Avoid passing the "online" state by simply obtaining the state
in virtio_mem_sbm_plug_any_sb().

Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20210602185720.31821-4-david@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
3 years agovirtio-mem: use page_zonenum() in virtio_mem_fake_offline()
David Hildenbrand [Wed, 2 Jun 2021 18:57:15 +0000 (20:57 +0200)]
virtio-mem: use page_zonenum() in virtio_mem_fake_offline()

Let's use page_zonenum() instead of zone_idx(page_zone()).

Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20210602185720.31821-3-david@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>