platform/kernel/linux-starfive.git
3 years agoinclude/linux/xz.h: drop duplicated word
Randy Dunlap [Wed, 12 Aug 2020 01:34:07 +0000 (18:34 -0700)]
include/linux/xz.h: drop duplicated word

Drop the doubled word "than" in a comment.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Lasse Collin <lasse.collin@tukaani.org>
Link: http://lkml.kernel.org/r/05ebba7a-c1e4-01ae-fc7b-15c081b33f3e@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoinclude/linux/async_tx.h: drop duplicated word in a comment
Randy Dunlap [Wed, 12 Aug 2020 01:34:04 +0000 (18:34 -0700)]
include/linux/async_tx.h: drop duplicated word in a comment

Drop the doubled word "the" in a comment.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Link: http://lkml.kernel.org/r/e85802f7-8f48-8b4c-29b3-ea237a2c7ae9@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoinclude/linux/exportfs.h: drop duplicated word in a comment
Randy Dunlap [Wed, 12 Aug 2020 01:34:00 +0000 (18:34 -0700)]
include/linux/exportfs.h: drop duplicated word in a comment

Drop the doubled word "a" in a comment.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Link: http://lkml.kernel.org/r/c61b707a-8fd8-5b1b-aab0-679122881543@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoinclude/linux/compiler-clang.h: drop duplicated word in a comment
Randy Dunlap [Wed, 12 Aug 2020 01:33:57 +0000 (18:33 -0700)]
include/linux/compiler-clang.h: drop duplicated word in a comment

Drop the doubled word "the" in a comment.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Link: http://lkml.kernel.org/r/6a18c301-3505-742f-4dd7-0f38d0e537b9@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoalpha: fix annotation of io{read,write}{16,32}be()
Luc Van Oostenryck [Wed, 12 Aug 2020 01:33:54 +0000 (18:33 -0700)]
alpha: fix annotation of io{read,write}{16,32}be()

These accessors must be used to read/write a big-endian bus.  The value
returned or written is native-endian.

However, these accessors are defined using be{16,32}_to_cpu() or
cpu_to_be{16,32}() to make the endian conversion but these expect a
__be{16,32} when none is present.  Keeping them would need a force cast
that would solve nothing at all.

So, do the conversion using swab{16,32}, like done in asm-generic for
similar situations.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Link: http://lkml.kernel.org/r/20200622114232.80039-1-luc.vanoostenryck@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoexec: use force_uaccess_begin during exec and exit
Christoph Hellwig [Wed, 12 Aug 2020 01:33:50 +0000 (18:33 -0700)]
exec: use force_uaccess_begin during exec and exit

Both exec and exit want to ensure that the uaccess routines actually do
access user pointers.  Use the newly added force_uaccess_begin helper
instead of an open coded set_fs for that to prepare for kernel builds
where set_fs() does not exist.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Link: http://lkml.kernel.org/r/20200710135706.537715-7-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agouaccess: add force_uaccess_{begin,end} helpers
Christoph Hellwig [Wed, 12 Aug 2020 01:33:47 +0000 (18:33 -0700)]
uaccess: add force_uaccess_{begin,end} helpers

Add helpers to wrap the get_fs/set_fs magic for undoing any damange done
by set_fs(KERNEL_DS).  There is no real functional benefit, but this
documents the intent of these calls better, and will allow stubbing the
functions out easily for kernels builds that do not allow address space
overrides in the future.

[hch@lst.de: drop two incorrect hunks, fix a commit log typo]
Link: http://lkml.kernel.org/r/20200714105505.935079-6-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Greentime Hu <green.hu@gmail.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Link: http://lkml.kernel.org/r/20200710135706.537715-6-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agouaccess: remove segment_eq
Christoph Hellwig [Wed, 12 Aug 2020 01:33:44 +0000 (18:33 -0700)]
uaccess: remove segment_eq

segment_eq is only used to implement uaccess_kernel.  Just open code
uaccess_kernel in the arch uaccess headers and remove one layer of
indirection.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Greentime Hu <green.hu@gmail.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Link: http://lkml.kernel.org/r/20200710135706.537715-5-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoriscv: include <asm/pgtable.h> in <asm/uaccess.h>
Christoph Hellwig [Wed, 12 Aug 2020 01:33:41 +0000 (18:33 -0700)]
riscv: include <asm/pgtable.h> in <asm/uaccess.h>

To ensure TASK_SIZE is defined for USER_DS.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Link: http://lkml.kernel.org/r/20200710135706.537715-4-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agonds32: use uaccess_kernel in show_regs
Christoph Hellwig [Wed, 12 Aug 2020 01:33:38 +0000 (18:33 -0700)]
nds32: use uaccess_kernel in show_regs

Use the uaccess_kernel helper instead of duplicating it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Greentime Hu <green.hu@gmail.com>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Link: http://lkml.kernel.org/r/20200710135706.537715-3-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agosyscalls: use uaccess_kernel in addr_limit_user_check
Christoph Hellwig [Wed, 12 Aug 2020 01:33:34 +0000 (18:33 -0700)]
syscalls: use uaccess_kernel in addr_limit_user_check

Patch series "clean up address limit helpers", v2.

In preparation for eventually phasing out direct use of set_fs(), this
series removes the segment_eq() arch helper that is only used to implement
or duplicate the uaccess_kernel() API, and then adds descriptive helpers
to force the kernel address limit.

This patch (of 6):

Use the uaccess_kernel helper instead of duplicating it.

[hch@lst.de: arm: don't call addr_limit_user_check for nommu]
Link: http://lkml.kernel.org/r/20200721045834.GA9613@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Link: http://lkml.kernel.org/r/20200714105505.935079-1-hch@lst.de
Link: http://lkml.kernel.org/r/20200710135706.537715-1-hch@lst.de
Link: http://lkml.kernel.org/r/20200710135706.537715-2-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/zsmalloc.c: fix duplicated words
Randy Dunlap [Wed, 12 Aug 2020 01:33:31 +0000 (18:33 -0700)]
mm/zsmalloc.c: fix duplicated words

Change "as as" to "as a".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200801173822.14973-16-rdunlap@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/zpool.c: delete duplicated word and fix grammar
Randy Dunlap [Wed, 12 Aug 2020 01:33:28 +0000 (18:33 -0700)]
mm/zpool.c: delete duplicated word and fix grammar

Drop the repeated word "if".
Fix subject/verb agreement.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200801173822.14973-15-rdunlap@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/vmscan.c: delete or fix duplicated words
Randy Dunlap [Wed, 12 Aug 2020 01:33:26 +0000 (18:33 -0700)]
mm/vmscan.c: delete or fix duplicated words

Drop the repeated word "marked".
Change "time time" to "same time".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200801173822.14973-14-rdunlap@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/usercopy.c: delete duplicated word
Randy Dunlap [Wed, 12 Aug 2020 01:33:23 +0000 (18:33 -0700)]
mm/usercopy.c: delete duplicated word

Drop the repeated word "the".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200801173822.14973-13-rdunlap@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/slab_common.c: delete duplicated word
Randy Dunlap [Wed, 12 Aug 2020 01:33:19 +0000 (18:33 -0700)]
mm/slab_common.c: delete duplicated word

Drop the repeated word "and".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200801173822.14973-12-rdunlap@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/shmem.c: delete duplicated word
Randy Dunlap [Wed, 12 Aug 2020 01:33:17 +0000 (18:33 -0700)]
mm/shmem.c: delete duplicated word

Drop the repeated word "the".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200801173822.14973-11-rdunlap@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/page_alloc.c: delete or fix duplicated words
Randy Dunlap [Wed, 12 Aug 2020 01:33:14 +0000 (18:33 -0700)]
mm/page_alloc.c: delete or fix duplicated words

Drop the repeated word "them" and "that".
Change "the the" to "to the".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200801173822.14973-10-rdunlap@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/nommu.c: delete duplicated words
Randy Dunlap [Wed, 12 Aug 2020 01:33:11 +0000 (18:33 -0700)]
mm/nommu.c: delete duplicated words

Drop the repeated word "that" in two places.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200801173822.14973-9-rdunlap@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/migrate.c: delete duplicated word
Randy Dunlap [Wed, 12 Aug 2020 01:33:08 +0000 (18:33 -0700)]
mm/migrate.c: delete duplicated word

Drop the repeated word "and".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200801173822.14973-8-rdunlap@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/memory.c: delete duplicated words
Randy Dunlap [Wed, 12 Aug 2020 01:33:05 +0000 (18:33 -0700)]
mm/memory.c: delete duplicated words

Drop the repeated word "to" in two places.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200801173822.14973-7-rdunlap@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/memcontrol.c: delete duplicated words
Randy Dunlap [Wed, 12 Aug 2020 01:33:02 +0000 (18:33 -0700)]
mm/memcontrol.c: delete duplicated words

Drop the repeated word "down".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200801173822.14973-6-rdunlap@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/hugetlb.c: delete duplicated words
Randy Dunlap [Wed, 12 Aug 2020 01:32:59 +0000 (18:32 -0700)]
mm/hugetlb.c: delete duplicated words

Drop the repeated word "the" in two places.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200801173822.14973-5-rdunlap@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/hmm.c: delete duplicated word
Randy Dunlap [Wed, 12 Aug 2020 01:32:56 +0000 (18:32 -0700)]
mm/hmm.c: delete duplicated word

Drop the repeated word "pages".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200801173822.14973-4-rdunlap@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/filemap.c: delete duplicated word
Randy Dunlap [Wed, 12 Aug 2020 01:32:53 +0000 (18:32 -0700)]
mm/filemap.c: delete duplicated word

Drop the repeated word "the".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200801173822.14973-3-rdunlap@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/compaction.c: delete duplicated word
Randy Dunlap [Wed, 12 Aug 2020 01:32:49 +0000 (18:32 -0700)]
mm/compaction.c: delete duplicated word

Drop the repeated word "a".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200801173822.14973-2-rdunlap@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agosparc: drop unused MAX_PHYSADDR_BITS
Arvind Sankar [Wed, 12 Aug 2020 01:32:46 +0000 (18:32 -0700)]
sparc: drop unused MAX_PHYSADDR_BITS

The macro is not used anywhere, so remove the definition.

Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Link: http://lkml.kernel.org/r/20200723231544.17274-4-nivedita@alum.mit.edu
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agosh/mm: drop unused MAX_PHYSADDR_BITS
Arvind Sankar [Wed, 12 Aug 2020 01:32:43 +0000 (18:32 -0700)]
sh/mm: drop unused MAX_PHYSADDR_BITS

The macro is not used anywhere, so remove the definition.

Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Link: http://lkml.kernel.org/r/20200723231544.17274-3-nivedita@alum.mit.edu
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoinclude/linux/memcontrol.h: drop duplicate word and fix spello
Randy Dunlap [Wed, 12 Aug 2020 01:32:40 +0000 (18:32 -0700)]
include/linux/memcontrol.h: drop duplicate word and fix spello

Drop the doubled word "for" in a comment.
Fix spello of "incremented".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Link: http://lkml.kernel.org/r/b04aa2e4-7c95-12f0-599d-43d07fb28134@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoinclude/linux/frontswap.h: drop duplicated word in a comment
Randy Dunlap [Wed, 12 Aug 2020 01:32:36 +0000 (18:32 -0700)]
include/linux/frontswap.h: drop duplicated word in a comment

Drop the doubled word "in" in a comment.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Link: http://lkml.kernel.org/r/3af7ed91-ad62-8445-40a4-9e07a64b9523@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoinclude/linux/highmem.h: fix duplicated words in a comment
Randy Dunlap [Wed, 12 Aug 2020 01:32:33 +0000 (18:32 -0700)]
include/linux/highmem.h: fix duplicated words in a comment

Change the doubled word "is" in a comment to "it is".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/ad605959-0083-4794-8d31-6b073300dd6f@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm: drop duplicated words in <linux/mm.h>
Randy Dunlap [Wed, 12 Aug 2020 01:32:30 +0000 (18:32 -0700)]
mm: drop duplicated words in <linux/mm.h>

Drop the doubled words "to" and "the".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: SeongJae Park <sjpark@amazon.de>
Link: http://lkml.kernel.org/r/d9fae8d6-0d60-4d52-9385-3199ee98de49@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm: drop duplicated words in <linux/pgtable.h>
Randy Dunlap [Wed, 12 Aug 2020 01:32:27 +0000 (18:32 -0700)]
mm: drop duplicated words in <linux/pgtable.h>

Drop the doubled words "used" and "by".

Drop the repeated acronym "TLB" and make several other fixes around it.
(capital letters, spellos)

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: SeongJae Park <sjpark@amazon.de>
Link: http://lkml.kernel.org/r/2bb6e13e-44df-4920-52d9-4d3539945f73@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm, memory_hotplug: update pcp lists everytime onlining a memory block
Charan Teja Reddy [Wed, 12 Aug 2020 01:32:24 +0000 (18:32 -0700)]
mm, memory_hotplug: update pcp lists everytime onlining a memory block

When onlining a first memory block in a zone, pcp lists are not updated
thus pcp struct will have the default setting of ->high = 0,->batch = 1.

This means till the second memory block in a zone(if it have) is onlined
the pcp lists of this zone will not contain any pages because pcp's
->count is always greater than ->high thus free_pcppages_bulk() is called
to free batch size(=1) pages every time system wants to add a page to the
pcp list through free_unref_page().

To put this in a word, system is not using benefits offered by the pcp
lists when there is a single onlineable memory block in a zone.  Correct
this by always updating the pcp lists when memory block is onlined.

Fixes: 1f522509c77a ("mem-hotplug: avoid multiple zones sharing same boot strapping boot_pageset")
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Link: http://lkml.kernel.org/r/1596372896-15336-1-git-send-email-charante@codeaurora.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/memory_hotplug: fix unpaired mem_hotplug_begin/done
Jia He [Wed, 12 Aug 2020 01:32:20 +0000 (18:32 -0700)]
mm/memory_hotplug: fix unpaired mem_hotplug_begin/done

When check_memblock_offlined_cb() returns failed rc(e.g. the memblock is
online at that time), mem_hotplug_begin/done is unpaired in such case.

Therefore a warning:
 Call Trace:
  percpu_up_write+0x33/0x40
  try_remove_memory+0x66/0x120
  ? _cond_resched+0x19/0x30
  remove_memory+0x2b/0x40
  dev_dax_kmem_remove+0x36/0x72 [kmem]
  device_release_driver_internal+0xf0/0x1c0
  device_release_driver+0x12/0x20
  bus_remove_device+0xe1/0x150
  device_del+0x17b/0x3e0
  unregister_dev_dax+0x29/0x60
  devm_action_release+0x15/0x20
  release_nodes+0x19a/0x1e0
  devres_release_all+0x3f/0x50
  device_release_driver_internal+0x100/0x1c0
  driver_detach+0x4c/0x8f
  bus_remove_driver+0x5c/0xd0
  driver_unregister+0x31/0x50
  dax_pmem_exit+0x10/0xfe0 [dax_pmem]

Fixes: f1037ec0cc8a ("mm/memory_hotplug: fix remove_memory() lockdep splat")
Signed-off-by: Jia He <justin.he@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org> [5.6+]
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chuhong Yuan <hslester96@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Cc: Kaly Xin <Kaly.Xin@arm.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200710031619.18762-3-justin.he@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/memory_hotplug: introduce default dummy memory_add_physaddr_to_nid()
Jia He [Wed, 12 Aug 2020 01:32:16 +0000 (18:32 -0700)]
mm/memory_hotplug: introduce default dummy memory_add_physaddr_to_nid()

This is to introduce a general dummy helper.  memory_add_physaddr_to_nid()
is a fallback option to get the nid in case NUMA_NO_NID is detected.

After this patch, arm64/sh/s390 can simply use the general dummy version.
PowerPC/x86/ia64 will still use their specific version.

This is the preparation to set a fallback value for dev_dax->target_node.

Signed-off-by: Jia He <justin.he@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Chuhong Yuan <hslester96@gmail.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Cc: Kaly Xin <Kaly.Xin@arm.com>
Link: http://lkml.kernel.org/r/20200710031619.18762-2-justin.he@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agox86/mm: use max memory block size on bare metal
Daniel Jordan [Wed, 12 Aug 2020 01:32:12 +0000 (18:32 -0700)]
x86/mm: use max memory block size on bare metal

Some of our servers spend significant time at kernel boot initializing
memory block sysfs directories and then creating symlinks between them and
the corresponding nodes.  The slowness happens because the machines get
stuck with the smallest supported memory block size on x86 (128M), which
results in 16,288 directories to cover the 2T of installed RAM.  The
search for each memory block is noticeable even with commit 4fb6eabf1037
("drivers/base/memory.c: cache memory blocks in xarray to accelerate
lookup").

Commit 078eb6aa50dc ("x86/mm/memory_hotplug: determine block size based on
the end of boot memory") chooses the block size based on alignment with
memory end.  That addresses hotplug failures in qemu guests, but for bare
metal systems whose memory end isn't aligned to even the smallest size, it
leaves them at 128M.

Make kernels that aren't running on a hypervisor use the largest supported
size (2G) to minimize overhead on big machines.  Kernel boot goes 7%
faster on the aforementioned servers, shaving off half a second.

[daniel.m.jordan@oracle.com: v3]
Link: http://lkml.kernel.org/r/20200714205450.945834-1-daniel.m.jordan@oracle.com
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200609225451.3542648-1-daniel.m.jordan@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm: mmu_notifier: fix and extend kerneldoc
Krzysztof Kozlowski [Wed, 12 Aug 2020 01:32:09 +0000 (18:32 -0700)]
mm: mmu_notifier: fix and extend kerneldoc

Fix W=1 compile warnings (invalid kerneldoc):

    mm/mmu_notifier.c:187: warning: Function parameter or member 'interval_sub' not described in 'mmu_interval_read_bgin'
    mm/mmu_notifier.c:708: warning: Function parameter or member 'subscription' not described in 'mmu_notifier_registr'
    mm/mmu_notifier.c:708: warning: Excess function parameter 'mn' description in 'mmu_notifier_register'
    mm/mmu_notifier.c:880: warning: Function parameter or member 'subscription' not described in 'mmu_notifier_put'
    mm/mmu_notifier.c:880: warning: Excess function parameter 'mn' description in 'mmu_notifier_put'
    mm/mmu_notifier.c:982: warning: Function parameter or member 'ops' not described in 'mmu_interval_notifier_insert'

Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: http://lkml.kernel.org/r/20200728171109.28687-4-krzk@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoinclude/linux/sched/mm.h: optimize current_gfp_context()
Waiman Long [Wed, 12 Aug 2020 01:32:06 +0000 (18:32 -0700)]
include/linux/sched/mm.h: optimize current_gfp_context()

The current_gfp_context() converts a number of PF_MEMALLOC_* per-process
flags into the corresponding GFP_* flags for memory allocation.  In that
function, current->flags is accessed 3 times.  That may lead to duplicated
access of the same memory location.

This is not usually a problem with minimal debug config options on as the
compiler can optimize away the duplicated memory accesses.  With most of
the debug config options on, however, that may not be the case.  For
example, the x86-64 object size of the __need_fs_reclaim() in a debug
kernel that calls current_gfp_context() was 309 bytes.  With this patch
applied, the object size is reduced to 202 bytes.  This is a saving of 107
bytes and will probably be slightly faster too.

Use READ_ONCE() to access current->flags to prevent the compiler from
possibly accessing current->flags multiple times.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michel Lespinasse <walken@google.com>
Link: http://lkml.kernel.org/r/20200618212936.9776-1-longman@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agocma: don't quit at first error when activating reserved areas
Mike Kravetz [Wed, 12 Aug 2020 01:32:03 +0000 (18:32 -0700)]
cma: don't quit at first error when activating reserved areas

The routine cma_init_reserved_areas is designed to activate all
reserved cma areas.  It quits when it first encounters an error.
This can leave some areas in a state where they are reserved but
not activated.  There is no feedback to code which performed the
reservation.  Attempting to allocate memory from areas in such a
state will result in a BUG.

Modify cma_init_reserved_areas to always attempt to activate all
areas.  The called routine, cma_activate_area is responsible for
leaving the area in a valid state.  No one is making active use
of returned error codes, so change the routine to void.

How to reproduce:  This example uses kernelcore, hugetlb and cma
as an easy way to reproduce.  However, this is a more general cma
issue.

Two node x86 VM 16GB total, 8GB per node
Kernel command line parameters, kernelcore=4G hugetlb_cma=8G
Related boot time messages,
  hugetlb_cma: reserve 8192 MiB, up to 4096 MiB per node
  cma: Reserved 4096 MiB at 0x0000000100000000
  hugetlb_cma: reserved 4096 MiB on node 0
  cma: Reserved 4096 MiB at 0x0000000300000000
  hugetlb_cma: reserved 4096 MiB on node 1
  cma: CMA area hugetlb could not be activated

 # echo 8 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages

  BUG: kernel NULL pointer dereference, address: 0000000000000000
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP PTI
  ...
  Call Trace:
    bitmap_find_next_zero_area_off+0x51/0x90
    cma_alloc+0x1a5/0x310
    alloc_fresh_huge_page+0x78/0x1a0
    alloc_pool_huge_page+0x6f/0xf0
    set_max_huge_pages+0x10c/0x250
    nr_hugepages_store_common+0x92/0x120
    ? __kmalloc+0x171/0x270
    kernfs_fop_write+0xc1/0x1a0
    vfs_write+0xc7/0x1f0
    ksys_write+0x5f/0xe0
    do_syscall_64+0x4d/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: c64be2bb1c6e ("drivers: add Contiguous Memory Allocator")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Barry Song <song.bao.hua@hisilicon.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200730163123.6451-1-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm: hugetlb: fix the name of hugetlb CMA
Barry Song [Wed, 12 Aug 2020 01:32:00 +0000 (18:32 -0700)]
mm: hugetlb: fix the name of hugetlb CMA

Once we enable CMA_DEBUGFS, we will get the below errors: directory
'cma-hugetlb' with parent 'cma' already present.

We should have different names for different CMA areas.

Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Roman Gushchin <guro@fb.com>
Link: http://lkml.kernel.org/r/20200616223131.33828-3-song.bao.hua@hisilicon.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm: cma: fix the name of CMA areas
Barry Song [Wed, 12 Aug 2020 01:31:57 +0000 (18:31 -0700)]
mm: cma: fix the name of CMA areas

Patch series "mm: fix the names of general cma and hugetlb cma", v2.

The current code of CMA can only work when users pass a const string as
name parameter.  we need to fix the way to handle names in CMA.  On the
other hand, to avoid name conflicts after enabling CMA_DEBUGFS, each
hugetlb should get a different CMA name.

This patch (of 2):

If users give a name saved in stack, the current code will generate magic
pointer.  if users don't give a name(NULL), kasprintf() will always return
NULL as we are at the early stage.  that means cma_init_reserved_mem()
will return -ENOMEM if users set name parameter as NULL.

[natechancellor@gmail.com: return cma->name directly in cma_get_name]
Link: https://github.com/ClangBuiltLinux/linux/issues/1063
Link: http://lkml.kernel.org/r/20200623015840.621964-1-natechancellor@gmail.com
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Roman Gushchin <guro@fb.com>
Link: http://lkml.kernel.org/r/20200616223131.33828-2-song.bao.hua@hisilicon.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/cma.c: fix NULL pointer dereference when cma could not be activated
Jianqun Xu [Wed, 12 Aug 2020 01:31:54 +0000 (18:31 -0700)]
mm/cma.c: fix NULL pointer dereference when cma could not be activated

In some case the cma area could not be activated, but the cma_alloc be
used under this case, then the kernel will crash caused by NULL pointer
dereference.

Add bitmap valid check in cma_alloc to avoid this issue.

Signed-off-by: Jianqun Xu <jay.xu@rock-chips.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: http://lkml.kernel.org/r/20200615010123.15596-1-jay.xu@rock-chips.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/vmstat: add events for THP migration without split
Anshuman Khandual [Wed, 12 Aug 2020 01:31:51 +0000 (18:31 -0700)]
mm/vmstat: add events for THP migration without split

Add following new vmstat events which will help in validating THP
migration without split.  Statistics reported through these new VM events
will help in performance debugging.

1. THP_MIGRATION_SUCCESS
2. THP_MIGRATION_FAILURE
3. THP_MIGRATION_SPLIT

In addition, these new events also update normal page migration statistics
appropriately via PGMIGRATE_SUCCESS and PGMIGRATE_FAILURE.  While here,
this updates current trace event 'mm_migrate_pages' to accommodate now
available THP statistics.

[akpm@linux-foundation.org: s/hpage_nr_pages/thp_nr_pages/]
[ziy@nvidia.com: v2]
Link: http://lkml.kernel.org/r/C5E3C65C-8253-4638-9D3C-71A61858BB8B@nvidia.com
[anshuman.khandual@arm.com: s/thp_nr_pages/hpage_nr_pages/]
Link: http://lkml.kernel.org/r/1594287583-16568-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Link: http://lkml.kernel.org/r/1594080415-27924-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm: thp: remove debug_cow switch
Yang Shi [Wed, 12 Aug 2020 01:31:48 +0000 (18:31 -0700)]
mm: thp: remove debug_cow switch

Since commit 3917c80280c93a7123f ("thp: change CoW semantics for
anon-THP"), the CoW page fault of THP has been rewritten, debug_cow is not
used anymore.  So, just remove it.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: http://lkml.kernel.org/r/1592270980-116062-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/migrate: add migrate-shared test for migrate_vma_*()
Ralph Campbell [Wed, 12 Aug 2020 01:31:45 +0000 (18:31 -0700)]
mm/migrate: add migrate-shared test for migrate_vma_*()

Add a migrate_vma_*() self test for mmap(MAP_SHARED) to verify that
!vma_anonymous() ranges won't be migrated.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: "Bharata B Rao" <bharata@linux.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Link: http://lkml.kernel.org/r/20200710194840.7602-3-rcampbell@nvidia.com
Link: http://lkml.kernel.org/r/20200709165711.26584-3-rcampbell@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/migrate: optimize migrate_vma_setup() for holes
Ralph Campbell [Wed, 12 Aug 2020 01:31:41 +0000 (18:31 -0700)]
mm/migrate: optimize migrate_vma_setup() for holes

Patch series "mm/migrate: optimize migrate_vma_setup() for holes".

A simple optimization for migrate_vma_*() when the source vma is not an
anonymous vma and a new test case to exercise it.

This patch (of 2):

When migrating system memory to device private memory, if the source
address range is a valid VMA range and there is no memory or a zero page,
the source PFN array is marked as valid but with no PFN.

This lets the device driver allocate private memory and clear it, then
insert the new device private struct page into the CPU's page tables when
migrate_vma_pages() is called.  migrate_vma_pages() only inserts the new
page if the VMA is an anonymous range.

There is no point in telling the device driver to allocate device private
memory and then not migrate the page.  Instead, mark the source PFN array
entries as not migrating to avoid this overhead.

[rcampbell@nvidia.com: v2]
Link: http://lkml.kernel.org/r/20200710194840.7602-2-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: "Bharata B Rao" <bharata@linux.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Link: http://lkml.kernel.org/r/20200710194840.7602-1-rcampbell@nvidia.com
Link: http://lkml.kernel.org/r/20200709165711.26584-1-rcampbell@nvidia.com
Link: http://lkml.kernel.org/r/20200709165711.26584-2-rcampbell@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agohugetlbfs: remove call to huge_pte_alloc without i_mmap_rwsem
Mike Kravetz [Wed, 12 Aug 2020 01:31:38 +0000 (18:31 -0700)]
hugetlbfs: remove call to huge_pte_alloc without i_mmap_rwsem

Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") requires callers of huge_pte_alloc to hold i_mmap_rwsem
in at least read mode.  This is because the explicit locking in
huge_pmd_share (called by huge_pte_alloc) was removed.  When restructuring
the code, the call to huge_pte_alloc in the else block at the beginning of
hugetlb_fault was missed.

Unfortunately, that else clause is exercised when there is no page table
entry.  This will likely lead to a call to huge_pmd_share.  If
huge_pmd_share thinks pmd sharing is possible, it will traverse the
mapping tree (i_mmap) without holding i_mmap_rwsem.  If someone else is
modifying the tree, bad things such as addressing exceptions or worse
could happen.

Simply remove the else clause.  It should have been removed previously.
The code following the else will call huge_pte_alloc with the appropriate
locking.

To prevent this type of issue in the future, add routines to assert that
i_mmap_rwsem is held, and call these routines in huge pmd sharing
routines.

Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A.Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/e670f327-5cf9-1959-96e4-6dc7cc30d3d5@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agohugetlbfs: prevent filesystem stacking of hugetlbfs
Mike Kravetz [Wed, 12 Aug 2020 01:31:35 +0000 (18:31 -0700)]
hugetlbfs: prevent filesystem stacking of hugetlbfs

syzbot found issues with having hugetlbfs on a union/overlay as reported
in [1].  Due to the limitations (no write) and special functionality of
hugetlbfs, it does not work well in filesystem stacking.  There are no
know use cases for hugetlbfs stacking.  Rather than making modifications
to get hugetlbfs working in such environments, simply prevent stacking.

[1] https://lore.kernel.org/linux-mm/000000000000b4684e05a2968ca6@google.com/

Reported-by: syzbot+d6ec23007e951dadf3de@syzkaller.appspotmail.com
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Colin Walters <walters@verbum.org>
Link: http://lkml.kernel.org/r/80f869aa-810d-ef6c-8888-b46cee135907@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm, oom: show process exiting information in __oom_kill_process()
Yafang Shao [Wed, 12 Aug 2020 01:31:32 +0000 (18:31 -0700)]
mm, oom: show process exiting information in __oom_kill_process()

When the OOM killer finds a victim and tryies to kill it, if the victim is
already exiting, the task mm will be NULL and no process will be killed.
But the dump_header() has been already executed, so it will be strange to
dump so much information without killing a process.  We'd better show some
helpful information to indicate why this happens.

Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/20200721010127.17238-1-laoar.shao@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agodoc, mm: clarify /proc/<pid>/oom_score value range
Michal Hocko [Wed, 12 Aug 2020 01:31:28 +0000 (18:31 -0700)]
doc, mm: clarify /proc/<pid>/oom_score value range

The exported value includes oom_score_adj so the range is no [0, 1000] as
described in the previous section but rather [0, 2000].  Mention that fact
explicitly.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Link: http://lkml.kernel.org/r/20200709062603.18480-2-mhocko@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agodoc, mm: sync up oom_score_adj documentation
Michal Hocko [Wed, 12 Aug 2020 01:31:25 +0000 (18:31 -0700)]
doc, mm: sync up oom_score_adj documentation

There are at least two notes in the oom section.  The 3% discount for root
processes is gone since d46078b28889 ("mm, oom: remove 3% bonus for
CAP_SYS_ADMIN processes").

Likewise children of the selected oom victim are not sacrificed since
bbbe48029720 ("mm, oom: remove 'prefer children over parent' heuristic")

Drop both of them.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Link: http://lkml.kernel.org/r/20200709062603.18480-1-mhocko@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm, oom: make the calculation of oom badness more accurate
Yafang Shao [Wed, 12 Aug 2020 01:31:22 +0000 (18:31 -0700)]
mm, oom: make the calculation of oom badness more accurate

Recently we found an issue on our production environment that when memcg
oom is triggered the oom killer doesn't chose the process with largest
resident memory but chose the first scanned process.  Note that all
processes in this memcg have the same oom_score_adj, so the oom killer
should chose the process with largest resident memory.

Bellow is part of the oom info, which is enough to analyze this issue.
[7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
[7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
[7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
[...]
[7516987.983293] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[7516987.983510] [ 5740]     0  5740      257        1    32768        0          -998 pause
[7516987.983574] [58804]     0 58804     4594      771    81920        0          -998 entry_point.bas
[7516987.983577] [58908]     0 58908     7089      689    98304        0          -998 cron
[7516987.983580] [58910]     0 58910    16235     5576   163840        0          -998 supervisord
[7516987.983590] [59620]     0 59620    18074     1395   188416        0          -998 sshd
[7516987.983594] [59622]     0 59622    18680     6679   188416        0          -998 python
[7516987.983598] [59624]     0 59624  1859266     5161   548864        0          -998 odin-agent
[7516987.983600] [59625]     0 59625   707223     9248   983040        0          -998 filebeat
[7516987.983604] [59627]     0 59627   416433    64239   774144        0          -998 odin-log-agent
[7516987.983607] [59631]     0 59631   180671    15012   385024        0          -998 python3
[7516987.983612] [61396]     0 61396   791287     3189   352256        0          -998 client
[7516987.983615] [61641]     0 61641  1844642    29089   946176        0          -998 client
[7516987.983765] [ 9236]     0  9236     2642      467    53248        0          -998 php_scanner
[7516987.983911] [42898]     0 42898    15543      838   167936        0          -998 su
[7516987.983915] [42900]  1000 42900     3673      867    77824        0          -998 exec_script_vr2
[7516987.983918] [42925]  1000 42925    36475    19033   335872        0          -998 python
[7516987.983921] [57146]  1000 57146     3673      848    73728        0          -998 exec_script_J2p
[7516987.983925] [57195]  1000 57195   186359    22958   491520        0          -998 python2
[7516987.983928] [58376]  1000 58376   275764    14402   290816        0          -998 rosmaster
[7516987.983931] [58395]  1000 58395   155166     4449   245760        0          -998 rosout
[7516987.983935] [58406]  1000 58406 18285584  3967322 37101568        0          -998 data_sim
[7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
[7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

We can find that the first scanned process 5740 (pause) was killed, but
its rss is only one page.  That is because, when we calculate the oom
badness in oom_badness(), we always ignore the negtive point and convert
all of these negtive points to 1.  Now as oom_score_adj of all the
processes in this targeted memcg have the same value -998, the points of
these processes are all negtive value.  As a result, the first scanned
process will be killed.

The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
a Guaranteed pod, which has higher priority to prevent from being killed
by system oom.

To fix this issue, we should make the calculation of oom point more
accurate.  We can achieve it by convert the chosen_point from 'unsigned
long' to 'long'.

[cai@lca.pw: reported a issue in the previous version]
[mhocko@suse.com: fixed the issue reported by Cai]
[mhocko@suse.com: add the comment in proc_oom_score()]
[laoar.shao@gmail.com: v3]
Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoinclude/linux/mempolicy.h: fix typo
Yanfei Xu [Wed, 12 Aug 2020 01:31:19 +0000 (18:31 -0700)]
include/linux/mempolicy.h: fix typo

Change "interlave" to "interleave".

Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200810063454.9357-1-yanfei.xu@windriver.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/mempolicy.c: check parameters first in kernel_get_mempolicy
Wenchao Hao [Wed, 12 Aug 2020 01:31:16 +0000 (18:31 -0700)]
mm/mempolicy.c: check parameters first in kernel_get_mempolicy

Previous implementatoin calls untagged_addr() before error check, while if
the error check failed and return EINVAL, the untagged_addr() call is just
useless work.

Signed-off-by: Wenchao Hao <haowenchao22@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200801090825.5597-1-haowenchao22@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm: mempolicy: fix kerneldoc of numa_map_to_online_node()
Krzysztof Kozlowski [Wed, 12 Aug 2020 01:31:13 +0000 (18:31 -0700)]
mm: mempolicy: fix kerneldoc of numa_map_to_online_node()

Fix W=1 compile warnings (invalid kerneldoc):

    mm/mempolicy.c:137: warning: Function parameter or member 'node' not described in 'numa_map_to_online_node'
    mm/mempolicy.c:137: warning: Excess function parameter 'nid' description in 'numa_map_to_online_node'

Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200728171109.28687-3-krzk@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/compaction: correct the comments of compact_defer_shift
Alex Shi [Wed, 12 Aug 2020 01:31:10 +0000 (18:31 -0700)]
mm/compaction: correct the comments of compact_defer_shift

There is no compact_defer_limit. It should be compact_defer_shift in
use. and add compact_order_failed explanation.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Link: http://lkml.kernel.org/r/3bd60e1b-a74e-050d-ade4-6e8f54e00b92@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm: use unsigned types for fragmentation score
Nitin Gupta [Wed, 12 Aug 2020 01:31:07 +0000 (18:31 -0700)]
mm: use unsigned types for fragmentation score

Proactive compaction uses per-node/zone "fragmentation score" which is
always in range [0, 100], so use unsigned type of these scores as well as
for related constants.

Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Link: http://lkml.kernel.org/r/20200618010319.13159-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm: fix compile error due to COMPACTION_HPAGE_ORDER
Nitin Gupta [Wed, 12 Aug 2020 01:31:04 +0000 (18:31 -0700)]
mm: fix compile error due to COMPACTION_HPAGE_ORDER

Fix compile error when COMPACTION_HPAGE_ORDER is assigned to
HUGETLB_PAGE_ORDER.  The correct way to check if this constant is defined
is to check for CONFIG_HUGETLBFS.

Reported-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623064544.25766-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm: proactive compaction
Nitin Gupta [Wed, 12 Aug 2020 01:31:00 +0000 (18:31 -0700)]
mm: proactive compaction

For some applications, we need to allocate almost all memory as hugepages.
However, on a running system, higher-order allocations can fail if the
memory is fragmented.  Linux kernel currently does on-demand compaction as
we request more hugepages, but this style of compaction incurs very high
latency.  Experiments with one-time full memory compaction (followed by
hugepage allocations) show that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within <1 sec
for a 32G system.  Such data suggests that a more proactive compaction can
help us allocate a large fraction of memory as hugepages keeping
allocation latencies low.

For a more proactive compaction, the approach taken here is to define a
new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
external fragmentation which kcompactd tries to maintain.

The tunable takes a value in range [0, 100], with a default of 20.

Note that a previous version of this patch [1] was found to introduce too
many tunables (per-order extfrag{low, high}), but this one reduces them to
just one sysctl.  Also, the new tunable is an opaque value instead of
asking for specific bounds of "external fragmentation", which would have
been difficult to estimate.  The internal interpretation of this opaque
value allows for future fine-tuning.

Currently, we use a simple translation from this tunable to [low, high]
"fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
The score for a node is defined as weighted mean of per-zone external
fragmentation.  A zone's present_pages determines its weight.

To periodically check per-node score, we reuse per-node kcompactd threads,
which are woken up every 500 milliseconds to check the same.  If a node's
score exceeds its high threshold (as derived from user-provided
proactiveness value), proactive compaction is started until its score
reaches its low threshold value.  By default, proactiveness is set to 20,
which implies threshold values of low=80 and high=90.

This patch is largely based on ideas from Michal Hocko [2].  See also the
LWN article [3].

Performance data
================

System: x64_64, 1T RAM, 80 CPU threads.
Kernel: 5.6.0-rc3 + this patch

echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

Before starting the driver, the system was fragmented from a userspace
program that allocates all memory and then for each 2M aligned section,
frees 3/4 of base pages using munmap.  The workload is mainly anonymous
userspace pages, which are easy to move around.  I intentionally avoided
unmovable pages in this test to see how much latency we incur when
hugepage allocations hit direct compaction.

1. Kernel hugepage allocation latencies

With the system in such a fragmented state, a kernel driver then allocates
as many hugepages as possible and measures allocation latency:

(all latency values are in microseconds)

- With vanilla 5.6.0-rc3

  percentile latency
  –––––––––– –––––––
   5    7894
  10    9496
  25   12561
  30   15295
  40   18244
  50   21229
  60   27556
  75   30147
  80   31047
  90   32859
  95   33799

Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)

- With 5.6.0-rc3 + this patch, with proactiveness=20

sysctl -w vm.compaction_proactiveness=20

  percentile latency
  –––––––––– –––––––
   5       2
  10       2
  25       3
  30       3
  40       3
  50       4
  60       4
  75       4
  80       4
  90       5
  95     429

Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)

2. JAVA heap allocation

In this test, we first fragment memory using the same method as for (1).

Then, we start a Java process with a heap size set to 700G and request the
heap to be allocated with THP hugepages.  We also set THP to madvise to
allow hugepage backing of this heap.

/usr/bin/time
 java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch

The above command allocates 700G of Java heap using hugepages.

- With vanilla 5.6.0-rc3

17.39user 1666.48system 27:37.89elapsed

- With 5.6.0-rc3 + this patch, with proactiveness=20

8.35user 194.58system 3:19.62elapsed

Elapsed time remains around 3:15, as proactiveness is further increased.

Note that proactive compaction happens throughout the runtime of these
workloads.  The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.

In the above Java workload, proactiveness is set to 20.  The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction.  As t he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the low
threshold level (80).  Repeat.

bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark.  kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.

Backoff behavior
================

Above workloads produce a memory state which is easy to compact.  However,
if memory is filled with unmovable pages, proactive compaction should
essentially back off.  To test this aspect:

- Created a kernel driver that allocates almost all memory as hugepages
  followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
  with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
  (=> ~30 seconds between retries).

[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/

Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Oleksandr Natalenko <oleksandr@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Nitin Gupta <ngupta@nitingupta.dev>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years ago/proc/PID/smaps: consistent whitespace output format
Michal Koutný [Wed, 12 Aug 2020 01:30:57 +0000 (18:30 -0700)]
/proc/PID/smaps: consistent whitespace output format

The keys in smaps output are padded to fixed width with spaces.  All
except for THPeligible that uses tabs (only since commit c06306696f83
("mm: thp: fix false negative of shmem vma's THP eligibility")).

Unify the output formatting to save time debugging some naïve parsers.
(Part of the unification is also aligning FilePmdMapped with others.)

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Link: http://lkml.kernel.org/r/20200728083207.17531-1-mkoutny@suse.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/vmscan: restore active/inactive ratio for anonymous LRU
Joonsoo Kim [Wed, 12 Aug 2020 01:30:54 +0000 (18:30 -0700)]
mm/vmscan: restore active/inactive ratio for anonymous LRU

Now that workingset detection is implemented for anonymous LRU, we don't
need large inactive list to allow detecting frequently accessed pages
before they are reclaimed, anymore.  This effectively reverts the
temporary measure put in by commit "mm/vmscan: make active/inactive ratio
as 1:1 for anon lru".

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Link: http://lkml.kernel.org/r/1595490560-15117-7-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/swap: implement workingset detection for anonymous LRU
Joonsoo Kim [Wed, 12 Aug 2020 01:30:50 +0000 (18:30 -0700)]
mm/swap: implement workingset detection for anonymous LRU

This patch implements workingset detection for anonymous LRU.  All the
infrastructure is implemented by the previous patches so this patch just
activates the workingset detection by installing/retrieving the shadow
entry and adding refault calculation.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Link: http://lkml.kernel.org/r/1595490560-15117-6-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/swapcache: support to handle the shadow entries
Joonsoo Kim [Wed, 12 Aug 2020 01:30:47 +0000 (18:30 -0700)]
mm/swapcache: support to handle the shadow entries

Workingset detection for anonymous page will be implemented in the
following patch and it requires to store the shadow entries into the
swapcache.  This patch implements an infrastructure to store the shadow
entry in the swapcache.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/1595490560-15117-5-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/workingset: prepare the workingset detection infrastructure for anon LRU
Joonsoo Kim [Wed, 12 Aug 2020 01:30:43 +0000 (18:30 -0700)]
mm/workingset: prepare the workingset detection infrastructure for anon LRU

To prepare the workingset detection for anon LRU, this patch splits
workingset event counters for refault, activate and restore into anon and
file variants, as well as the refaults counter in struct lruvec.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Link: http://lkml.kernel.org/r/1595490560-15117-4-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/vmscan: protect the workingset on anonymous LRU
Joonsoo Kim [Wed, 12 Aug 2020 01:30:40 +0000 (18:30 -0700)]
mm/vmscan: protect the workingset on anonymous LRU

In current implementation, newly created or swap-in anonymous page is
started on active list.  Growing active list results in rebalancing
active/inactive list so old pages on active list are demoted to inactive
list.  Hence, the page on active list isn't protected at all.

Following is an example of this situation.

Assume that 50 hot pages on active list.  Numbers denote the number of
pages on active/inactive list (active | inactive).

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)

3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)

This patch tries to fix this issue.  Like as file LRU, newly created or
swap-in anonymous pages will be inserted to the inactive list.  They are
promoted to active list if enough reference happens.  This simple
modification changes the above example as following.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)

3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)

As you can see, hot pages on active list would be protected.

Note that, this implementation has a drawback that the page cannot be
promoted and will be swapped-out if re-access interval is greater than the
size of inactive list but less than the size of total(active+inactive).
To solve this potential issue, following patch will apply workingset
detection similar to the one that's already applied to file LRU.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/vmscan: make active/inactive ratio as 1:1 for anon lru
Joonsoo Kim [Wed, 12 Aug 2020 01:30:36 +0000 (18:30 -0700)]
mm/vmscan: make active/inactive ratio as 1:1 for anon lru

Patch series "workingset protection/detection on the anonymous LRU list", v7.

* PROBLEM
In current implementation, newly created or swap-in anonymous page is
started on the active list.  Growing the active list results in
rebalancing active/inactive list so old pages on the active list are
demoted to the inactive list.  Hence, hot page on the active list isn't
protected at all.

Following is an example of this situation.

Assume that 50 hot pages on active list and system can contain total 100
pages.  Numbers denote the number of pages on active/inactive list (active
| inactive).  (h) stands for hot pages and (uo) stands for used-once
pages.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)

3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)

As we can see, hot pages are swapped-out and it would cause swap-in later.

* SOLUTION
Since this is what we want to avoid, this patchset implements workingset
protection.  Like as the file LRU list, newly created or swap-in anonymous
page is started on the inactive list.  Also, like as the file LRU list, if
enough reference happens, the page will be promoted.  This simple
modification changes the above example as following.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)

3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)

hot pages remains in the active list. :)

* EXPERIMENT
I tested this scenario on my test bed and confirmed that this problem
happens on current implementation. I also checked that it is fixed by
this patchset.

* SUBJECT
workingset detection

* PROBLEM
Later part of the patchset implements the workingset detection for the
anonymous LRU list.  There is a corner case that workingset protection
could cause thrashing.  If we can avoid thrashing by workingset detection,
we can get the better performance.

Following is an example of thrashing due to the workingset protection.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (will be hot) pages
50(h) | 50(wh)

3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(wh)

4. workload: 50 (will be hot) pages
50(h) | 50(wh), swap-in 50(wh)

5. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(wh)

6. repeat 4, 5

Without workingset detection, this kind of workload cannot be promoted and
thrashing happens forever.

* SOLUTION
Therefore, this patchset implements workingset detection.  All the
infrastructure for workingset detecion is already implemented, so there is
not much work to do.  First, extend workingset detection code to deal with
the anonymous LRU list.  Then, make swap cache handles the exceptional
value for the shadow entry.  Lastly, install/retrieve the shadow value
into/from the swap cache and check the refault distance.

* EXPERIMENT
I made a test program to imitates above scenario and confirmed that
problem exists.  Then, I checked that this patchset fixes it.

My test setup is a virtual machine with 8 cpus and 6100MB memory.  But,
the amount of the memory that the test program can use is about 280 MB.
This is because the system uses large ram-backed swap and large ramdisk to
capture the trace.

Test scenario is like as below.

1. allocate cold memory (512MB)
2. allocate hot-1 memory (96MB)
3. activate hot-1 memory (96MB)
4. allocate another hot-2 memory (96MB)
5. access cold memory (128MB)
6. access hot-2 memory (96MB)
7. repeat 5, 6

Since hot-1 memory (96MB) is on the active list, the inactive list can
contains roughly 190MB pages.  hot-2 memory's re-access interval (96+128
MB) is more 190MB, so it cannot be promoted without workingset detection
and swap-in/out happens repeatedly.  With this patchset, workingset
detection works and promotion happens.  Therefore, swap-in/out occurs
less.

Here is the result. (average of 5 runs)

type swap-in swap-out
base 863240 989945
patch 681565 809273

As we can see, patched kernel do less swap-in/out.

* OVERALL TEST (ebizzy using modified random function)
ebizzy is the test program that main thread allocates lots of memory and
child threads access them randomly during the given times.  Swap-in will
happen if allocated memory is larger than the system memory.

The random function that represents the zipf distribution is used to make
hot/cold memory.  Hot/cold ratio is controlled by the parameter.  If the
parameter is high, hot memory is accessed much larger than cold one.  If
the parameter is low, the number of access on each memory would be
similar.  I uses various parameters in order to show the effect of
patchset on various hot/cold ratio workload.

My test setup is a virtual machine with 8 cpus, 1024 MB memory and 5120 MB
ram swap.

Result format is as following.

param: 1-1024-0.1
- 1 (number of thread)
- 1024 (allocated memory size, MB)
- 0.1 (zipf distribution alpha,
0.1 works like as roughly uniform random,
1.3 works like as small portion of memory is hot and the others are cold)

pswpin: smaller is better
std: standard deviation
improvement: negative is better

* single thread
           param        pswpin       std       improvement
      base 1-1024.0-0.1 14101983.40   79441.19
      prot 1-1024.0-0.1 14065875.80  136413.01  (   -0.26 )
    detect 1-1024.0-0.1 13910435.60  100804.82  (   -1.36 )
      base 1-1024.0-0.7 7998368.80   43469.32
      prot 1-1024.0-0.7 7622245.80   88318.74  (   -4.70 )
    detect 1-1024.0-0.7 7618515.20   59742.07  (   -4.75 )
      base 1-1024.0-1.3 1017400.80   38756.30
      prot 1-1024.0-1.3  940464.60   29310.69  (   -7.56 )
    detect 1-1024.0-1.3  945511.40   24579.52  (   -7.07 )
      base 1-1280.0-0.1 22895541.40   50016.08
      prot 1-1280.0-0.1 22860305.40   51952.37  (   -0.15 )
    detect 1-1280.0-0.1 22705565.20   93380.35  (   -0.83 )
      base 1-1280.0-0.7 13717645.60   46250.65
      prot 1-1280.0-0.7 12935355.80   64754.43  (   -5.70 )
    detect 1-1280.0-0.7 13040232.00   63304.00  (   -4.94 )
      base 1-1280.0-1.3 1654251.40    4159.68
      prot 1-1280.0-1.3 1522680.60   33673.50  (   -7.95 )
    detect 1-1280.0-1.3 1599207.00   70327.89  (   -3.33 )
      base 1-1536.0-0.1 31621775.40   31156.28
      prot 1-1536.0-0.1 31540355.20   62241.36  (   -0.26 )
    detect 1-1536.0-0.1 31420056.00  123831.27  (   -0.64 )
      base 1-1536.0-0.7 19620760.60   60937.60
      prot 1-1536.0-0.7 18337839.60   56102.58  (   -6.54 )
    detect 1-1536.0-0.7 18599128.00   75289.48  (   -5.21 )
      base 1-1536.0-1.3 2378142.40   20994.43
      prot 1-1536.0-1.3 2166260.60   48455.46  (   -8.91 )
    detect 1-1536.0-1.3 2183762.20   16883.24  (   -8.17 )
      base 1-1792.0-0.1 40259714.80   90750.70
      prot 1-1792.0-0.1 40053917.20   64509.47  (   -0.51 )
    detect 1-1792.0-0.1 39949736.40  104989.64  (   -0.77 )
      base 1-1792.0-0.7 25704884.40   69429.68
      prot 1-1792.0-0.7 23937389.00   79945.60  (   -6.88 )
    detect 1-1792.0-0.7 24271902.00   35044.30  (   -5.57 )
      base 1-1792.0-1.3 3129497.00   32731.86
      prot 1-1792.0-1.3 2796994.40   19017.26  (  -10.62 )
    detect 1-1792.0-1.3 2886840.40   33938.82  (   -7.75 )
      base 1-2048.0-0.1 48746924.40   50863.88
      prot 1-2048.0-0.1 48631954.40   24537.30  (   -0.24 )
    detect 1-2048.0-0.1 48509419.80   27085.34  (   -0.49 )
      base 1-2048.0-0.7 32046424.40   78624.22
      prot 1-2048.0-0.7 29764182.20   86002.26  (   -7.12 )
    detect 1-2048.0-0.7 30250315.80  101282.14  (   -5.60 )
      base 1-2048.0-1.3 3916723.60   24048.55
      prot 1-2048.0-1.3 3490781.60   33292.61  (  -10.87 )
    detect 1-2048.0-1.3 3585002.20   44942.04  (   -8.47 )

* multi thread
           param        pswpin       std       improvement
      base 8-1024.0-0.1 16219822.60  329474.01
      prot 8-1024.0-0.1 15959494.00  654597.45  (   -1.61 )
    detect 8-1024.0-0.1 15773790.80  502275.25  (   -2.75 )
      base 8-1024.0-0.7 9174107.80  537619.33
      prot 8-1024.0-0.7 8571915.00  385230.08  (   -6.56 )
    detect 8-1024.0-0.7 8489484.20  364683.00  (   -7.46 )
      base 8-1024.0-1.3 1108495.60   83555.98
      prot 8-1024.0-1.3 1038906.20   63465.20  (   -6.28 )
    detect 8-1024.0-1.3  941817.80   32648.80  (  -15.04 )
      base 8-1280.0-0.1 25776114.20  450480.45
      prot 8-1280.0-0.1 25430847.00  465627.07  (   -1.34 )
    detect 8-1280.0-0.1 25282555.00  465666.55  (   -1.91 )
      base 8-1280.0-0.7 15218968.00  702007.69
      prot 8-1280.0-0.7 13957947.80  492643.86  (   -8.29 )
    detect 8-1280.0-0.7 14158331.20  238656.02  (   -6.97 )
      base 8-1280.0-1.3 1792482.80   30512.90
      prot 8-1280.0-1.3 1577686.40   34002.62  (  -11.98 )
    detect 8-1280.0-1.3 1556133.00   22944.79  (  -13.19 )
      base 8-1536.0-0.1 33923761.40  575455.85
      prot 8-1536.0-0.1 32715766.20  300633.51  (   -3.56 )
    detect 8-1536.0-0.1 33158477.40  117764.51  (   -2.26 )
      base 8-1536.0-0.7 20628907.80  303851.34
      prot 8-1536.0-0.7 19329511.20  341719.31  (   -6.30 )
    detect 8-1536.0-0.7 20013934.00  385358.66  (   -2.98 )
      base 8-1536.0-1.3 2588106.40  130769.20
      prot 8-1536.0-1.3 2275222.40   89637.06  (  -12.09 )
    detect 8-1536.0-1.3 2365008.40  124412.55  (   -8.62 )
      base 8-1792.0-0.1 43328279.20  946469.12
      prot 8-1792.0-0.1 41481980.80  525690.89  (   -4.26 )
    detect 8-1792.0-0.1 41713944.60  406798.93  (   -3.73 )
      base 8-1792.0-0.7 27155647.40  536253.57
      prot 8-1792.0-0.7 24989406.80  502734.52  (   -7.98 )
    detect 8-1792.0-0.7 25524806.40  263237.87  (   -6.01 )
      base 8-1792.0-1.3 3260372.80  137907.92
      prot 8-1792.0-1.3 2879187.80   63597.26  (  -11.69 )
    detect 8-1792.0-1.3 2892962.20   33229.13  (  -11.27 )
      base 8-2048.0-0.1 50583989.80  710121.48
      prot 8-2048.0-0.1 49599984.40  228782.42  (   -1.95 )
    detect 8-2048.0-0.1 50578596.00  660971.66  (   -0.01 )
      base 8-2048.0-0.7 33765479.60  812659.55
      prot 8-2048.0-0.7 30767021.20  462907.24  (   -8.88 )
    detect 8-2048.0-0.7 32213068.80  211884.24  (   -4.60 )
      base 8-2048.0-1.3 3941675.80   28436.45
      prot 8-2048.0-1.3 3538742.40   76856.08  (  -10.22 )
    detect 8-2048.0-1.3 3579397.80   58630.95  (   -9.19 )

As we can see, all the cases show improvement.  Especially, test case with
zipf distribution 1.3 show more improvements.  It means that if there is a
hot/cold tendency in anon pages, this patchset works better.

This patch (of 6):

Current implementation of LRU management for anonymous page has some
problems.  Most important one is that it doesn't protect the workingset,
that is, pages on the active LRU list.  Although, this problem will be
fixed in the following patchset, the preparation is required and this
patch does it.

What following patch does is to implement workingset protection.  After
the following patchset, newly created or swap-in pages will start their
lifetime on the inactive list.  If inactive list is too small, there is
not enough chance to be referenced and the page cannot become the
workingset.

In order to provide the newly anonymous or swap-in pages enough chance to
be referenced again, this patch makes active/inactive LRU ratio as 1:1.

This is just a temporary measure.  Later patch in the series introduces
workingset detection for anonymous LRU that will be used to better decide
if pages should start on the active and inactive list.  Afterwards this
patch is effectively reverted.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Link: http://lkml.kernel.org/r/1595490560-15117-1-git-send-email-iamjoonsoo.kim@lge.com
Link: http://lkml.kernel.org/r/1595490560-15117-2-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/hugetlb: add mempolicy check in the reservation routine
Muchun Song [Wed, 12 Aug 2020 01:30:32 +0000 (18:30 -0700)]
mm/hugetlb: add mempolicy check in the reservation routine

In the reservation routine, we only check whether the cpuset meets the
memory allocation requirements.  But we ignore the mempolicy of MPOL_BIND
case.  If someone mmap hugetlb succeeds, but the subsequent memory
allocation may fail due to mempolicy restrictions and receives the SIGBUS
signal.  This can be reproduced by the follow steps.

 1) Compile the test case.
    cd tools/testing/selftests/vm/
    gcc map_hugetlb.c -o map_hugetlb

 2) Pre-allocate huge pages. Suppose there are 2 numa nodes in the
    system. Each node will pre-allocate one huge page.
    echo 2 > /proc/sys/vm/nr_hugepages

 3) Run test case(mmap 4MB). We receive the SIGBUS signal.
    numactl --membind=3D0 ./map_hugetlb 4

With this patch applied, the mmap will fail in the step 3) and throw
"mmap: Cannot allocate memory".

[akpm@linux-foundation.org: include sched.h for `current']

Reported-by: Jianchao Guo <guojianchao@bytedance.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michel Lespinasse <walken@google.com>
Cc: Baoquan He <bhe@redhat.com>
Link: http://lkml.kernel.org/r/20200728034938.14993-1-songmuchun@bytedance.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agokselftests: cgroup: add perpcu memory accounting test
Roman Gushchin [Wed, 12 Aug 2020 01:30:29 +0000 (18:30 -0700)]
kselftests: cgroup: add perpcu memory accounting test

Add a simple test to check the percpu memory accounting.  The test creates
a cgroup tree with 1000 child cgroups and checks values of memory.current
and memory.stat::percpu.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200608230819.832349-6-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm: memcg: charge memcg percpu memory to the parent cgroup
Roman Gushchin [Wed, 12 Aug 2020 01:30:25 +0000 (18:30 -0700)]
mm: memcg: charge memcg percpu memory to the parent cgroup

Memory cgroups are using large chunks of percpu memory to store vmstat
data.  Yet this memory is not accounted at all, so in the case when there
are many (dying) cgroups, it's not exactly clear where all the memory is.

Because the size of memory cgroup internal structures can dramatically
exceed the size of object or page which is pinning it in the memory, it's
not a good idea to simply ignore it.  It actually breaks the isolation
between cgroups.

Let's account the consumed percpu memory to the parent cgroup.

[guro@fb.com: add WARN_ON_ONCE()s, per Johannes]
Link: http://lkml.kernel.org/r/20200811170611.GB1507044@carbon.DHCP.thefacebook.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-5-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm: memcg/percpu: per-memcg percpu memory statistics
Roman Gushchin [Wed, 12 Aug 2020 01:30:21 +0000 (18:30 -0700)]
mm: memcg/percpu: per-memcg percpu memory statistics

Percpu memory can represent a noticeable chunk of the total memory
consumption, especially on big machines with many CPUs.  Let's track
percpu memory usage for each memcg and display it in memory.stat.

A percpu allocation is usually scattered over multiple pages (and nodes),
and can be significantly smaller than a page.  So let's add a byte-sized
counter on the memcg level: MEMCG_PERCPU_B.  Byte-sized vmstat infra
created for slabs can be perfectly reused for percpu case.

[guro@fb.com: v3]
Link: http://lkml.kernel.org/r/20200623184515.4132564-4-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200608230819.832349-4-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm: memcg/percpu: account percpu memory to memory cgroups
Roman Gushchin [Wed, 12 Aug 2020 01:30:17 +0000 (18:30 -0700)]
mm: memcg/percpu: account percpu memory to memory cgroups

Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.

As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user.  Also, some cgroup internals (e.g.  memory controller
statistics) can be quite large.  On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.

So the lack of memcg accounting is creating a breach in the memory
isolation.  Similar to the slab memory, percpu memory should be accounted
by default.

To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow.  Let's introduce two types of percpu
chunks: root and memcg.  What makes memcg chunks different is an
additional space allocated to store memcg membership information.  If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.

To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.

[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agopercpu: return number of released bytes from pcpu_free_area()
Roman Gushchin [Wed, 12 Aug 2020 01:30:14 +0000 (18:30 -0700)]
percpu: return number of released bytes from pcpu_free_area()

Patch series "mm: memcg accounting of percpu memory", v3.

This patchset adds percpu memory accounting to memory cgroups.  It's based
on the rework of the slab controller and reuses concepts and features
introduced for the per-object slab accounting.

Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.

As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user.  Also, some cgroup internals (e.g.  memory controller
statistics) can be quite large.  On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.

So the lack of memcg accounting is creating a breach in the memory
isolation.  Similar to the slab memory, percpu memory should be accounted
by default.

Percpu allocations by their nature are scattered over multiple pages, so
they can't be tracked on the per-page basis.  So the per-object tracking
introduced by the new slab controller is reused.

The patchset implements charging of percpu allocations, adds memcg-level
statistics, enables accounting for percpu allocations made by memory
cgroup internals and provides some basic tests.

To implement the accounting of percpu memory without a significant memory
and performance overhead the following approach is used: all accounted
allocations are placed into a separate percpu chunk (or chunks).  These
chunks are similar to default chunks, except that they do have an attached
vector of pointers to obj_cgroup objects, which is big enough to save a
pointer for each allocated object.  On the allocation, if the allocation
has to be accounted (__GFP_ACCOUNT is passed, the allocating process
belongs to a non-root memory cgroup, etc), the memory cgroup is getting
charged and if the maximum limit is not exceeded the allocation is
performed using a memcg-aware chunk.  Otherwise -ENOMEM is returned or the
allocation is forced over the limit, depending on gfp (as any other kernel
memory allocation).  The memory cgroup information is saved in the
obj_cgroup vector at the corresponding offset.  On the release time the
memcg information is restored from the vector and the cgroup is getting
uncharged.  Unaccounted allocations (at this point the absolute majority
of all percpu allocations) are performed in the old way, so no additional
overhead is expected.

To avoid pinning dying memory cgroups by outstanding allocations,
obj_cgroup API is used instead of directly saving memory cgroup pointers.
obj_cgroup is basically a pointer to a memory cgroup with a standalone
reference counter.  The trick is that it can be atomically swapped to
point at the parent cgroup, so that the original memory cgroup can be
released prior to all objects, which has been charged to it.  Because all
charges and statistics are fully recursive, it's perfectly correct to
uncharge the parent cgroup instead.  This scheme is used in the slab
memory accounting, and percpu memory can just follow the scheme.

This patch (of 5):

To implement accounting of percpu memory we need the information about the
size of freed object.  Return it from pcpu_free_area().

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
cC: Michal Koutnýutny@suse.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-1-guro@fb.com
Link: http://lkml.kernel.org/r/20200608230819.832349-1-guro@fb.com
Link: http://lkml.kernel.org/r/20200608230819.832349-2-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agofix breakage in do_rmdir()
Al Viro [Wed, 12 Aug 2020 04:15:18 +0000 (05:15 +0100)]
fix breakage in do_rmdir()

syzbot reported and bisected a use-after-free due to the recent init
cleanups.

The putname() should happen only after we'd *not* branched to retry,
same as it's done in do_unlinkat().

Reported-by: syzbot+bbeb1c88016c7db4aa24@syzkaller.appspotmail.com
Fixes: e24ab0ef689d "fs: push the getname from do_rmdir into the callers"
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoparisc: Implement __smp_store_release and __smp_load_acquire barriers
John David Anglin [Thu, 30 Jul 2020 12:59:12 +0000 (08:59 -0400)]
parisc: Implement __smp_store_release and __smp_load_acquire barriers

This patch implements the __smp_store_release and __smp_load_acquire barriers
using ordered stores and loads.  This avoids the sync instruction present in
the generic implementation.

Cc: <stable@vger.kernel.org> # 4.14+
Signed-off-by: Dave Anglin <dave.anglin@bell.net>
Signed-off-by: Helge Deller <deller@gmx.de>
3 years agortc: pcf2127: fix alarm handling
Alexandre Belloni [Wed, 12 Aug 2020 08:51:14 +0000 (10:51 +0200)]
rtc: pcf2127: fix alarm handling

Fix multiple issues when handling alarms:
 - Use threaded interrupt to avoid scheduling when atomic
 - Stop matching on week day as it may not be set correctly
 - Avoid parsing the DT interrupt and use what is provided by the i2c or
   spi subsystem
 - Avoid returning IRQ_NONE in case of error in the interrupt handler
 - Never write WDTF as specified in the datasheet
 - Set uie_unsupported, as for the pcf85063, setting alarms every seconds
   is not working correctly and confuses the RTC.

Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Link: https://lore.kernel.org/r/20200812085114.474903-1-alexandre.belloni@bootlin.com
3 years agortc: pcf2127: add alarm support
Liam Beguin [Tue, 30 Jun 2020 02:42:11 +0000 (22:42 -0400)]
rtc: pcf2127: add alarm support

Add alarm support for the pcf2127 RTC chip family.
Tested on pca2129.

Signed-off-by: Liam Beguin <lvb@xiphos.com>
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Reviewed-by: Bruno Thomsen <bruno.thomsen@gmail.com>
Link: https://lore.kernel.org/r/20200630024211.12782-3-liambeguin@gmail.com
3 years agortc: pcf2127: add pca2129 device id
Liam Beguin [Tue, 30 Jun 2020 02:42:10 +0000 (22:42 -0400)]
rtc: pcf2127: add pca2129 device id

The PCA2129 is the automotive grade version of the PCF2129.
add it to the list of compatibles.

Signed-off-by: Liam Beguin <lvb@xiphos.com>
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Reviewed-by: Bruno Thomsen <bruno.thomsen@gmail.com>
Link: https://lore.kernel.org/r/20200630024211.12782-2-liambeguin@gmail.com
3 years agodrm/ttm: revert "drm/ttm: make TT creation purely optional v3"
Christian König [Wed, 12 Aug 2020 03:03:49 +0000 (13:03 +1000)]
drm/ttm: revert "drm/ttm: make TT creation purely optional v3"

This reverts commit 2ddef17678bc2ea1d20517dd2b4ed4aa967ffa8b.

As it turned out VMWGFX needs a much wider audit to fix this.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200811092400.188124-1-christian.koenig@amd.com
3 years agoMerge branch 'vmwgfx-next-5.9' of git://people.freedesktop.org/~sroland/linux into...
Dave Airlie [Wed, 12 Aug 2020 02:58:19 +0000 (12:58 +1000)]
Merge branch 'vmwgfx-next-5.9' of git://people.freedesktop.org/~sroland/linux into drm-next

The drm_mode_config_reset patches are very important fixing a recently
introduced kernel crash, the others fix various older issues which are
a bit less serious in practice.

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: "Roland Scheidegger (VMware)" <rscheidegger.oss@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200812005941.19465-1-rscheidegger.oss@gmail.com
3 years agoparisc: mask out enable and reserved bits from sba imask
Sven Schnelle [Tue, 11 Aug 2020 16:19:19 +0000 (18:19 +0200)]
parisc: mask out enable and reserved bits from sba imask

When using kexec the SBA IOMMU IBASE might still have the RE
bit set. This triggers a WARN_ON when trying to write back the
IBASE register later, and it also makes some mask calculations fail.

Cc: <stable@vger.kernel.org>
Signed-off-by: Sven Schnelle <svens@stackframe.org>
Signed-off-by: Helge Deller <deller@gmx.de>
3 years agoMerge tag 'tag-chrome-platform-for-v5.9' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Wed, 12 Aug 2020 00:28:32 +0000 (17:28 -0700)]
Merge tag 'tag-chrome-platform-for-v5.9' of git://git./linux/kernel/git/chrome-platform/linux

Pull chrome platform updates from Benson Leung:
 "cros_ec_typec:

   - Add support for switch control and alternate modes to the Chrome EC
     Type C port driver

   - Add basic suspend/resume support

  sensorhub:

   - Fix timestamp overflow issue

   - Fix legacy timestamp spreading on Nami systems

  cros_ec_proto:

   - After removing all users of, stop exporting cros_ec_cmd_xfer

   - Check for missing EC_CMD_HOST_EVENT_GET_WAKE_MASK and ignore
     wakeups on old ECs

  misc:

   - Documentation warning cleanup

   - Fix double unlock issue in ishtp"

* tag 'tag-chrome-platform-for-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux: (21 commits)
  platform/chrome: cros_ec_proto: check for missing EC_CMD_HOST_EVENT_GET_WAKE_MASK
  platform/chrome: cros_ec_proto: ignore unnecessary wakeups on old ECs
  platform/chrome: cros_ec_sensorhub: Simplify legacy timestamp spreading
  platform/chrome: cros_ec_proto: Do not export cros_ec_cmd_xfer()
  platform/chrome: cros_ec_typec: Unregister partner on error
  platform/chrome: cros_ec_sensorhub: Fix EC timestamp overflow
  platform/chrome: cros_ec_typec: Add PM support
  platform/chrome: cros_ec_typec: Use workqueue for port update
  platform/chrome: cros_ec_typec: Add a dependency on USB_ROLE_SWITCH
  platform/chrome: cros_ec_ishtp: Fix a double-unlock issue
  platform/chrome: cros_ec_rpmsg: Document missing struct parameters
  platform/chrome: cros_ec_spi: Document missing function parameters
  platform/chrome: cros_ec_typec: Add TBT compat support
  platform/chrome: cros_ec: Add TBT pd_ctrl fields
  platform/chrome: cros_ec_typec: Make configure_mux static
  platform/chrome: cros_ec_typec: Support DP alt mode
  platform/chrome: cros_ec_typec: Add USB mux control
  platform/chrome: cros_ec_typec: Register PD CTRL cmd v2
  platform/chrome: cros_ec: Update mux state bits
  platform/chrome: cros_ec_typec: Register Type C switches
  ...

3 years agoMerge tag 'for-linus-5.9-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubca...
Linus Torvalds [Wed, 12 Aug 2020 00:08:11 +0000 (17:08 -0700)]
Merge tag 'for-linus-5.9-ofs1' of git://git./linux/kernel/git/hubcap/linux

Pull orangefs updates from Mike Marshall:
 "A fix and a cleanup...

  The fix: Al Viro pointed out that I had broken some acl functionality
  with one of my previous patches.

  And the cleanup: Jing Xiangfeng found and removed a needless variable
  assignment"

* tag 'for-linus-5.9-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
  orangefs: remove unnecessary assignment to variable ret
  orangefs: posix acl fix...

3 years agoMerge tag 'zonefs-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal...
Linus Torvalds [Wed, 12 Aug 2020 00:05:55 +0000 (17:05 -0700)]
Merge tag 'zonefs-5.9-rc1' of git://git./linux/kernel/git/dlemoal/zonefs

Pull zonefs update from Damien Le Moal:
 "A single change for this cycle adding support for zone capacities
  smaller than the zone size, from Johannes"

* tag 'zonefs-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
  zonefs: update documentation to reflect zone size vs capacity
  zonefs: add zone-capacity support

3 years agoMerge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Linus Torvalds [Tue, 11 Aug 2020 21:43:12 +0000 (14:43 -0700)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux

Pull arm64 fix from Catalin Marinas:
 "Fix recordmcount build failure on non-arm64 (caused by an arm64
  patch)"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  recordmcount: Fix build failure on non arm64

3 years agoMerge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Linus Torvalds [Tue, 11 Aug 2020 21:34:17 +0000 (14:34 -0700)]
Merge tag 'for_linus' of git://git./linux/kernel/git/mst/vhost

Pull virtio updates from Michael Tsirkin:

 - IRQ bypass support for vdpa and IFC

 - MLX5 vdpa driver

 - Endianness fixes for virtio drivers

 - Misc other fixes

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (71 commits)
  vdpa/mlx5: fix up endian-ness for mtu
  vdpa: Fix pointer math bug in vdpasim_get_config()
  vdpa/mlx5: Fix pointer math in mlx5_vdpa_get_config()
  vdpa/mlx5: fix memory allocation failure checks
  vdpa/mlx5: Fix uninitialised variable in core/mr.c
  vdpa_sim: init iommu lock
  virtio_config: fix up warnings on parisc
  vdpa/mlx5: Add VDPA driver for supported mlx5 devices
  vdpa/mlx5: Add shared memory registration code
  vdpa/mlx5: Add support library for mlx5 VDPA implementation
  vdpa/mlx5: Add hardware descriptive header file
  vdpa: Modify get_vq_state() to return error code
  net/vdpa: Use struct for set/get vq state
  vdpa: remove hard coded virtq num
  vdpasim: support batch updating
  vhost-vdpa: support IOTLB batching hints
  vhost-vdpa: support get/set backend features
  vhost: generialize backend features setting/getting
  vhost-vdpa: refine ioctl pre-processing
  vDPA: dont change vq irq after DRIVER_OK
  ...

3 years agoMerge tag 'for-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux...
Linus Torvalds [Tue, 11 Aug 2020 21:30:36 +0000 (14:30 -0700)]
Merge tag 'for-v5.9' of git://git./linux/kernel/git/jmorris/linux-security

Pull security subsystem updates from James Morris:
 "A couple of minor documentation updates only for this release"

* tag 'for-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
  LSM: drop duplicated words in header file comments
  Replace HTTP links with HTTPS ones: security

3 years agoMerge tag 'iommu-updates-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/joro...
Linus Torvalds [Tue, 11 Aug 2020 21:13:24 +0000 (14:13 -0700)]
Merge tag 'iommu-updates-v5.9' of git://git./linux/kernel/git/joro/iommu

Pull iommu updates from Joerg Roedel:

 - Remove of the dev->archdata.iommu (or similar) pointers from most
   architectures. Only Sparc is left, but this is private to Sparc as
   their drivers don't use the IOMMU-API.

 - ARM-SMMU updates from Will Deacon:

     - Support for SMMU-500 implementation in Marvell Armada-AP806 SoC

     - Support for SMMU-500 implementation in NVIDIA Tegra194 SoC

     - DT compatible string updates

     - Remove unused IOMMU_SYS_CACHE_ONLY flag

     - Move ARM-SMMU drivers into their own subdirectory

 - Intel VT-d updates from Lu Baolu:

     - Misc tweaks and fixes for vSVA

     - Report/response page request events

     - Cleanups

 - Move the Kconfig and Makefile bits for the AMD and Intel drivers into
   their respective subdirectory.

 - MT6779 IOMMU Support

 - Support for new chipsets in the Renesas IOMMU driver

 - Other misc cleanups and fixes (e.g. to improve compile test coverage)

* tag 'iommu-updates-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (77 commits)
  iommu/amd: Move Kconfig and Makefile bits down into amd directory
  iommu/vt-d: Move Kconfig and Makefile bits down into intel directory
  iommu/arm-smmu: Move Arm SMMU drivers into their own subdirectory
  iommu/vt-d: Skip TE disabling on quirky gfx dedicated iommu
  iommu: Add gfp parameter to io_pgtable_ops->map()
  iommu: Mark __iommu_map_sg() as static
  iommu/vt-d: Rename intel-pasid.h to pasid.h
  iommu/vt-d: Add page response ops support
  iommu/vt-d: Report page request faults for guest SVA
  iommu/vt-d: Add a helper to get svm and sdev for pasid
  iommu/vt-d: Refactor device_to_iommu() helper
  iommu/vt-d: Disable multiple GPASID-dev bind
  iommu/vt-d: Warn on out-of-range invalidation address
  iommu/vt-d: Fix devTLB flush for vSVA
  iommu/vt-d: Handle non-page aligned address
  iommu/vt-d: Fix PASID devTLB invalidation
  iommu/vt-d: Remove global page support in devTLB flush
  iommu/vt-d: Enforce PASID devTLB field mask
  iommu: Make some functions static
  iommu/amd: Remove double zero check
  ...

3 years agoMerge tag 'backlight-next-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/lee...
Linus Torvalds [Tue, 11 Aug 2020 20:48:02 +0000 (13:48 -0700)]
Merge tag 'backlight-next-5.9' of git://git./linux/kernel/git/lee/backlight

Pull backlight updates from Lee Jones:
 "Core Framework:
   - Trivial: Code refactoring
   - New API backlight_is_blank()
   - New API backlight_get_brightness()
   - Additional/reworked documentation
   - Remove 'extern' labels from prototypes
   - Drop backlight_put()
   - Staticify of_find_backlight()

  Driver Removal:
   - Removal of unused OT200 driver
   - Removal of unused Generic Backlight driver

  Fix-ups
   - Bunch of W=1 warning fixes
   - Convert to GPIO descriptors; sky81452
   - Move platform data handling into driver; sky81452
   - Remove superfluous code; lms501kf03
   - Many instances of using new APIs"

* tag 'backlight-next-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight: (34 commits)
  video: backlight: cr_bllcd: Remove unused variable 'intensity'
  backlight: backlight: Make of_find_backlight static
  backlight: backlight: Drop backlight_put()
  backlight: Use backlight_get_brightness() throughout
  backlight: jornada720_bl: Introduce backlight_is_blank()
  backlight: gpio_backlight: Simplify update_status()
  backlight: cr_bllcd: Introduce gpio-backlight semantics
  backlight: as3711_bl: Simplify update_status
  backlight: backlight: Introduce backlight_get_brightness()
  doc-rst: Wire-up Backlight kernel-doc documentation
  backlight: backlight: Add overview and update existing doc
  backlight: backlight: Drop extern from prototypes
  backlight: generic_bl: Remove this driver as it is unused
  backlight: backlight: Document enums in backlight.h
  backlight: backlight: Document inline functions in backlight.h
  backlight: backlight: Improve backlight_device documentation
  backlight: backlight: Improve backlight_properties documentation
  backlight: backlight: Improve backlight_ops documentation
  backlight: backlight: Add backlight_is_blank()
  backlight: backlight: Refactor fb_notifier_callback()
  ...

3 years agoMerge tag 'hwlock-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson...
Linus Torvalds [Tue, 11 Aug 2020 18:53:34 +0000 (11:53 -0700)]
Merge tag 'hwlock-v5.9' of git://git./linux/kernel/git/andersson/remoteproc

Pull hwspinlock updates from Bjorn Andersson:
 "This introduces a new DT binding format to describe the Qualcomm
  hardware mutex block and deprecates the old, invalid, one.

  It also cleans up the Kconfig slightly"

* tag 'hwlock-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson/remoteproc:
  dt-bindings: hwlock: qcom: Remove invalid binding
  hwspinlock: qcom: Allow mmio usage in addition to syscon
  dt-bindings: hwlock: qcom: Allow device on mmio bus
  dt-bindings: hwlock: qcom: Migrate binding to YAML
  hwspinlock: Simplify Kconfig

3 years agoMerge tag 'rproc-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson...
Linus Torvalds [Tue, 11 Aug 2020 18:17:45 +0000 (11:17 -0700)]
Merge tag 'rproc-v5.9' of git://git./linux/kernel/git/andersson/remoteproc

Pull remoteproc updates from Bjorn Andersson:
 "This introduces a new "detached" state for remote processors that are
  deemed to be running at the time Linux boots and the infrastructure
  for "attaching" to these. It then introduces the support for
  performing this operation for the STM32 platform.

  The coredump functionality is moved out from the core file and gains
  support for an optional mode where the recovery phase awaits the
  notification from devcoredump that the dump should be released. This
  allows userspace to grab the coredump in scenarios where vmalloc space
  is too low for creating a complete copy of the coredump before handing
  this to devcoredump.

  A new character device based interface is introduced to allow tying
  the stoppage of a remote processor to the termination of a user space
  process. This is useful in situations when such process provides
  crucial resources/operations for the firmware running on the remote
  processor.

  The Texas Instrument K3 driver gains support for the C66x and C71x
  DSPs.

  Qualcomm remoteprocs gains support for stashing relocation information
  in IMEM, to aid post mortem debugging and the crash notification
  mechanism is generalized to be reusable in cases where loosely coupled
  drivers needs to know about the status of a remote processor. One such
  example is the IPA hardware block, which is jointly owned with the
  modem and migrated to this improved interface.

  It also introduces a number of bug fixes and debug improvements for
  the Qualcomm modem remoteproc driver.

  And it cleans up the inconsistent interface for remoteproc drivers to
  implement power management"

* tag 'rproc-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson/remoteproc: (56 commits)
  remoteproc: core: Register the character device interface
  remoteproc: Add remoteproc character device interface
  remoteproc: kill IPA notify code
  net: ipa: new notification infrastructure
  remoteproc: k3-dsp: Add support for C71x DSPs
  dt-bindings: remoteproc: k3-dsp: Update bindings for C71x DSPs
  remoteproc: k3-dsp: Add support for L2RAM loading on C66x DSPs
  remoteproc: k3-dsp: Add a remoteproc driver of K3 C66x DSPs
  dt-bindings: remoteproc: Add bindings for C66x DSPs on TI K3 SoCs
  remoteproc: k3: Add TI-SCI processor control helper functions
  remoteproc: Introduce rproc_of_parse_firmware() helper
  dt-bindings: arm: keystone: Add common TI SCI bindings
  remoteproc: qcom_q6v5_mss: Remove redundant running state
  remoteproc: qcom: q6v5: Update running state before requesting stop
  remoteproc: qcom_q6v5_mss: Add modem debug policy support
  remoteproc: qcom_q6v5_mss: Validate modem blob firmware size before load
  remoteproc: qcom_q6v5_mss: Validate MBA firmware size before load
  rpmsg: update documentation
  remoteproc: qcom_q6v5_mss: Add MBA log extraction support
  remoteproc: Add coredump debugfs entry
  ...

3 years agoMerge tag 'rpmsg-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson...
Linus Torvalds [Tue, 11 Aug 2020 18:13:17 +0000 (11:13 -0700)]
Merge tag 'rpmsg-v5.9' of git://git./linux/kernel/git/andersson/remoteproc

Pull rpmsg update from Bjorn Andersson:
 "This ensures that rpmsg uses little-endian, per the VirtIO 1.0
  specification"

* tag 'rpmsg-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson/remoteproc:
  rpmsg: virtio: add endianness conversions

3 years agoMerge tag 'libnvdimm-for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm...
Linus Torvalds [Tue, 11 Aug 2020 17:59:19 +0000 (10:59 -0700)]
Merge tag 'libnvdimm-for-5.9' of git://git./linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updayes from Vishal Verma:
 "You'd normally receive this pull request from Dan Williams, but he's
  busy watching a newborn (Congrats Dan!), so I'm watching libnvdimm
  this cycle.

  This adds a new feature in libnvdimm - 'Runtime Firmware Activation',
  and a few small cleanups and fixes in libnvdimm and DAX. I'd
  originally intended to make separate topic-based pull requests - one
  for libnvdimm, and one for DAX, but some of the DAX material fell out
  since it wasn't quite ready.

  Summary:

   - add 'Runtime Firmware Activation' support for NVDIMMs that
     advertise the relevant capability

   - misc libnvdimm and DAX cleanups"

* tag 'libnvdimm-for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  libnvdimm/security: ensure sysfs poll thread woke up and fetch updated attr
  libnvdimm/security: the 'security' attr never show 'overwrite' state
  libnvdimm/security: fix a typo
  ACPI: NFIT: Fix ARS zero-sized allocation
  dax: Fix incorrect argument passed to xas_set_err()
  ACPI: NFIT: Add runtime firmware activate support
  PM, libnvdimm: Add runtime firmware activation support
  libnvdimm: Convert to DEVICE_ATTR_ADMIN_RO()
  drivers/dax: Expand lock scope to cover the use of addresses
  fs/dax: Remove unused size parameter
  dax: print error message by pr_info() in __generic_fsdax_supported()
  driver-core: Introduce DEVICE_ATTR_ADMIN_{RO,RW}
  tools/testing/nvdimm: Emulate firmware activation commands
  tools/testing/nvdimm: Prepare nfit_ctl_test() for ND_CMD_CALL emulation
  tools/testing/nvdimm: Add command debug messages
  tools/testing/nvdimm: Cleanup dimm index passing
  ACPI: NFIT: Define runtime firmware activation commands
  ACPI: NFIT: Move bus_dsm_mask out of generic nvdimm_bus_descriptor
  libnvdimm: Validate command family indices

3 years agoparisc: Whitespace cleanups in atomic.h
Helge Deller [Sun, 14 Jun 2020 08:50:42 +0000 (10:50 +0200)]
parisc: Whitespace cleanups in atomic.h

Fix whitespace indenting and drop trailing backslashes.

Cc: <stable@vger.kernel.org> # 4.19+
Signed-off-by: Helge Deller <deller@gmx.de>
3 years agoKVM: MIPS/VZ: Fix build error caused by 'kvm_run' cleanup
Xingxing Su [Tue, 11 Aug 2020 11:18:37 +0000 (07:18 -0400)]
KVM: MIPS/VZ: Fix build error caused by 'kvm_run' cleanup

Commit c34b26b98caca48ec9ee9 ("KVM: MIPS: clean up redundant 'kvm_run'
parameters") remove the 'kvm_run' parameter in kvm_vz_gpsi_lwc2.

The following build error:

arch/mips/kvm/vz.c: In function 'kvm_trap_vz_handle_gpsi':
arch/mips/kvm/vz.c:1243:43: error: 'run' undeclared (first use in this function)
   er = kvm_vz_gpsi_lwc2(inst, opc, cause, run, vcpu);
                                           ^~~
arch/mips/kvm/vz.c:1243:43: note: each undeclared identifier is reported only
once for each function it appears in
scripts/Makefile.build:283: recipe for target 'arch/mips/kvm/vz.o' failed
make[2]: *** [arch/mips/kvm/vz.o] Error 1
scripts/Makefile.build:500: recipe for target 'arch/mips/kvm' failed
make[1]: *** [arch/mips/kvm] Error 2
Makefile:1785: recipe for target 'arch/mips' failed
make: *** [arch/mips] Error 2

Signed-off-by: Xingxing Su <suxingxing@loongson.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 years agoparisc/kernel/ftrace: Remove function callback casts
Oscar Carter [Sat, 4 Jul 2020 15:50:57 +0000 (17:50 +0200)]
parisc/kernel/ftrace: Remove function callback casts

In an effort to enable -Wcast-function-type in the top-level Makefile to
support Control Flow Integrity builds, remove all the function callback
casts.

Co-developed-by: Helge Deller <deller@gmx.de>
Signed-off-by: Oscar Carter <oscar.carter@gmx.com>
Signed-off-by: Helge Deller <deller@gmx.de>
3 years agosections.h: dereference_function_descriptor() returns void pointer
Helge Deller [Sun, 9 Aug 2020 05:30:06 +0000 (07:30 +0200)]
sections.h: dereference_function_descriptor() returns void pointer

The function dereference_function_descriptor() takes on hppa64, ppc64
and ia64 a pointer to a function descriptor and returns a (void) pointer
to the dereferenced function.
To make cross-arch coding easier, on all other architectures the
dereference_function_descriptor() macro should return a void pointer
too.

Signed-off-by: Helge Deller <deller@gmx.de>
3 years agozonefs: update documentation to reflect zone size vs capacity
Johannes Thumshirn [Tue, 21 Jul 2020 12:10:27 +0000 (21:10 +0900)]
zonefs: update documentation to reflect zone size vs capacity

Update the zonefs documentation to reflect the difference between a zone's
size and it's capacity.

The maximum file size in zonefs is the zones capacity, for ZBC and ZAC
based devices, which do not have a separate zone capacity, the zone
capacity is equal to the zone size.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
3 years agozonefs: add zone-capacity support
Johannes Thumshirn [Tue, 21 Jul 2020 12:10:26 +0000 (21:10 +0900)]
zonefs: add zone-capacity support

In the zoned storage model, the sectors within a zone are typically all
writeable. With the introduction of the Zoned Namespace (ZNS) Command
Set in the NVM Express organization, the model was extended to have a
specific writeable capacity.

This zone capacity can be less than the overall zone size for a NVMe ZNS
device or null_blk in zoned-mode. For other ZBC/ZAC devices the zone
capacity is always equal to the zone size.

Use the zone capacity field instead from blk_zone for determining the
maximum inode size and inode blocks in zonefs.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
3 years agoMerge tag 'amd-drm-fixes-5.9-2020-08-07' of git://people.freedesktop.org/~agd5f/linux...
Dave Airlie [Tue, 11 Aug 2020 03:08:44 +0000 (13:08 +1000)]
Merge tag 'amd-drm-fixes-5.9-2020-08-07' of git://people.freedesktop.org/~agd5f/linux into drm-next

amd-drm-fixes-5.9-2020-08-07:

amdgpu:
- Re-add spelling typo fix
- Sienna Cichlid fixes
- Navy Flounder fixes
- DC fixes
- SMU i2c fix
- Power fixes

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexdeucher@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200807222843.3909-1-alexander.deucher@amd.com