review.tizen.org Git - platform/kernel/linux-rpi.git/log

pgtable: add pmd_index() fallback

It is needed for ARM 32bit non-LPAE builds.

Change-Id: Ic38faf98600a7f9cb46142c1bd39649df3b14692
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

tizen: enable MG-LRU in tizen_bcm2711_defconfig

Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Change-Id: Ibc0ca8782288234908e2bf71364fe5cdb4925b6e

ANDROID: MGLRU: Avoid reactivation of anon pages on swap full

This is a combination of the following two commits:

ANDROID: MGLRU: Don't skip anon reclaim if swap low

MGLRU tries to avoid doing unnecessary anon reclaim work if swap is
low by checking if the available swap is less than the MIN_BATCH_SIZE
(256kB). This can lead to unintened consequences where PSI pressure is
less and LMKD doesn't wake up in time to avoid file cache thrashing.

Remove this check to preserve the old bahavior. This can be improved
later on once we have a low swap notification event from the kernel.

Bug: 268574308
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
ANDROID: MGLRU: Avoid reactivation of anon pages on swap full

Avoid anon reclaim if swapping full since this reactivates the
pages.

Bug: 261619133
Bug: 276521916
Change-Id: I197513d50458f2b752957152ef67fe285c7bbaf4
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
[backport of the commit 0491ec319e940b68a1e94145a14e1c65915d7a45 from android13-5.15 branch]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

BACKPORT: mm: Multi-gen LRU: remove wait_event_killable()

Android 14 and later default to MGLRU [1] and field telemetry showed
occasional long tail latency (>100ms) in the reclaim path.

Tracing revealed priority inversion in the reclaim path.  In
try_to_inc_max_seq(), when high priority tasks were blocked on
wait_event_killable(), the preemption of the low priority task to call
wake_up_all() caused those high priority tasks to wait longer than
necessary.  In general, this problem is not different from others of its
kind, e.g., one caused by mutex_lock().  However, it is specific to MGLRU
because it introduced the new wait queue lruvec->mm_state.wait.

The purpose of this new wait queue is to avoid the thundering herd
problem.  If many direct reclaimers rush into try_to_inc_max_seq(), only
one can succeed, i.e., the one to wake up the rest, and the rest who
failed might cause premature OOM kills if they do not wait.  So far there
is no evidence supporting this scenario, based on how often the wait has
been hit.  And this begs the question how useful the wait queue is in
practice.

Based on Minchan's recommendation, which is in line with his commit
6d4675e60135 ("mm: don't be stuck to rmap lock on reclaim path") and the
rest of the MGLRU code which also uses trylock when possible, remove the
wait queue.

[1] https://android-review.googlesource.com/q/I7ed7fbfd6ef9ce10053347528125dd98c39e50bf

Link: https://lkml.kernel.org/r/20230413214326.2147568-1-kaleshsingh@google.com
Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
Change-Id: I911f3968fd1adb25171279cc5b6f48ccb7efc8de
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Suggested-by: Minchan Kim <minchan@kernel.org>
Reported-by: Wei Wang <wvw@google.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 7f63cf2d9b9bbe7b90f808927558a66ff737d399)
Bug: 277906484
[ Kalesh Singh - Fix conflicts in mm/vmscan.c; rename force_scan ->
full_scan]
[ Kalesh Singh - Fix up ABI breakages in include/linux/mmzone.h ]
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
[backport of the commit 10d315f8354581a6b07b0e8b4eabf68524063983 from android13-5.15 branch]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

BACKPORT: mm: multi-gen LRU: move lru_gen_add_mm() out of IRQ-off region

lru_gen_add_mm() has been added within an IRQ-off region in the commit
mentioned below. The other invocations of lru_gen_add_mm() are not within
an IRQ-off region.

The invocation within IRQ-off region is problematic on PREEMPT_RT because
the function is using a spin_lock_t which must not be used within
IRQ-disabled regions.

The other invocations of lru_gen_add_mm() occur while
task_struct::alloc_lock is acquired. Move lru_gen_add_mm() after
interrupts are enabled and before task_unlock().

Bug: 254441685
Link: https://lkml.kernel.org/r/20221026134830.711887-1-bigeasy@linutronix.de
Fixes: bd74fdaea1460 ("mm: multi-gen LRU: support page table walks")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Yu Zhao <yuzhao@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit dda1c41a07b4a4c3f99b5b28c1e8c485205fe860)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I0ab2d5811f6c8df16a4deb58ab6aa9717eac565f
[backport of the commit ad8cc978ccc17a0fd1149ebd76407b629907a727 from android13-5.15 branch]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

BACKPORT: mm: multi-gen LRU: design doc

Add a design doc.

Link: https://lkml.kernel.org/r/20220918080010.2920238-15-yuzhao@google.com
Change-Id: I78c0956fb9f3b34d9faf96ecdef31fe0c5aeb184
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 8be976a0937a18118424dd2505925081d9192fd5)
[ Kalesh Singh - Fix trivial conflicts in Documentation/vm/index.rst ]
Bug: 249601646
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
[backport of the commit 383505860c229dcd6ef4dc5cc458dddead48e7c9 from android13-5.15 branch]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

UPSTREAM: mm: multi-gen LRU: admin guide

Add an admin guide.

Link: https://lkml.kernel.org/r/20220918080010.2920238-14-yuzhao@google.com
Change-Id: Ia4dba47e8231eda4f0e76fb8969df7291a9bfe7c
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 07017acb06012d250fb68930e809257e6694d324)
Bug: 249601646
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
[backport of the commit 3fa3e8ad5d5f1201852234fedccdbd33bfe731b7 from android13-5.15 branch]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

BACKPORT: mm: multi-gen LRU: debugfs interface

Add /sys/kernel/debug/lru_gen for working set estimation and proactive
reclaim.  These techniques are commonly used to optimize job scheduling
(bin packing) in data centers [1][2].

Compared with the page table-based approach and the PFN-based
approach, this lruvec-based approach has the following advantages:
1. It offers better choices because it is aware of memcgs, NUMA nodes,
   shared mappings and unmapped page cache.
2. It is more scalable because it is O(nr_hot_pages), whereas the
   PFN-based approach is O(nr_total_pages).

Add /sys/kernel/debug/lru_gen_full for debugging.

[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731

Link: https://lkml.kernel.org/r/20220918080010.2920238-13-yuzhao@google.com
Change-Id: I2d07e743e7ed139fb2ba16d5f2c1a5f32f238ccc
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit d6c3af7d8a2ba5602c28841248c551a712ac50f5)
Bug: 249601646
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
[backport of the commit a95784fdac0bffbac611fe90910dbc58a00ddd23 from android13-5.15 branch]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

UPSTREAM: mm: multi-gen LRU: thrashing prevention

Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention, as
requested by many desktop users [1].

When set to value N, it prevents the working set of N milliseconds from
getting evicted.  The OOM killer is triggered if this working set cannot
be kept in memory.  Based on the average human detectable lag (~100ms),
N=1000 usually eliminates intolerable lags due to thrashing.  Larger
values like N=3000 make lags less noticeable at the risk of premature OOM
kills.

Compared with the size-based approach [2], this time-based approach
has the following advantages:

1. It is easier to configure because it is agnostic to applications
   and memory sizes.
2. It is more reliable because it is directly wired to the OOM killer.

[1] https://lore.kernel.org/r/Ydza%2FzXKY9ATRoh6@google.com/
[2] https://lore.kernel.org/r/20101028191523.GA14972@google.com/

Link: https://lkml.kernel.org/r/20220918080010.2920238-12-yuzhao@google.com
Change-Id: I52c14154a55c3e131d6e43fc623b3030cfa435ec
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 1332a809d95a4fc763cabe5ecb6d4fb6a6d941b2)
Bug: 249601646
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
[backport of the commit dd4f2bd6c074cbd906bbbb1de6c950b0feb782a4 from android13-5.15 branch]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

BACKPORT: mm: multi-gen LRU: kill switch

Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that
can be disabled include:
  0x0001: the multi-gen LRU core
  0x0002: walking page table, when arch_has_hw_pte_young() returns
          true
  0x0004: clearing the accessed bit in non-leaf PMD entries, when
          CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
  [yYnN]: apply to all the components above
E.g.,
  echo y >/sys/kernel/mm/lru_gen/enabled
  cat /sys/kernel/mm/lru_gen/enabled
  0x0007
  echo 5 >/sys/kernel/mm/lru_gen/enabled
  cat /sys/kernel/mm/lru_gen/enabled
  0x0005

NB: the page table walks happen on the scale of seconds under heavy memory
pressure, in which case the mmap_lock contention is a lesser concern,
compared with the LRU lock contention and the I/O congestion.  So far the
only well-known case of the mmap_lock contention happens on Android, due
to Scudo [1] which allocates several thousand VMAs for merely a few
hundred MBs.  The SPF and the Maple Tree also have provided their own
assessments [2][3].  However, if walking page tables does worsen the
mmap_lock contention, the kill switch can be used to disable it.  In this
case the multi-gen LRU will suffer a minor performance degradation, as
shown previously.

Clearing the accessed bit in non-leaf PMD entries can also be disabled,
since this behavior was not tested on x86 varieties other than Intel and
AMD.

[1] https://source.android.com/devices/tech/debug/scudo
[2] https://lore.kernel.org/r/20220128131006.67712-1-michel@lespinasse.org/
[3] https://lore.kernel.org/r/20220426150616.3937571-1-Liam.Howlett@oracle.com/

Link: https://lkml.kernel.org/r/20220918080010.2920238-11-yuzhao@google.com
Change-Id: If3116e6698cc6967b6992c2017962fac6c2d3a11
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 354ed597442952fb680c9cafc7e4eb8a76f9514c)
Bug: 249601646
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
[backport of the commit 94d1a38c479812f97af8a865c2d8b6298898ea18 from android13-5.15 branch]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

BACKPORT: mm: multi-gen LRU: optimize multiple memcgs

When multiple memcgs are available, it is possible to use generations as a
frame of reference to make better choices and improve overall performance
under global memory pressure.  This patch adds a basic optimization to
select memcgs that can drop single-use unmapped clean pages first.  Doing
so reduces the chance of going into the aging path or swapping, which can
be costly.

A typical example that benefits from this optimization is a server running
mixed types of workloads, e.g., heavy anon workload in one memcg and heavy
buffered I/O workload in the other.

Though this optimization can be applied to both kswapd and direct reclaim,
it is only added to kswapd to keep the patchset manageable.  Later
improvements may cover the direct reclaim path.

While ensuring certain fairness to all eligible memcgs, proportional scans
of individual memcgs also require proper backoff to avoid overshooting
their aggregate reclaim target by too much.  Otherwise it can cause high
direct reclaim latency.  The conditions for backoff are:

1. At low priorities, for direct reclaim, if aging fairness or direct
   reclaim latency is at risk, i.e., aging one memcg multiple times or
   swapping after the target is met.
2. At high priorities, for global reclaim, if per-zone free pages are
   above respective watermarks.

Server benchmark results:
  Mixed workloads:
    fio (buffered I/O): +[19, 21]%
                IOPS         BW
      patch1-8: 1880k        7343MiB/s
      patch1-9: 2252k        8796MiB/s

    memcached (anon): +[119, 123]%
                Ops/sec      KB/sec
      patch1-8: 862768.65    33514.68
      patch1-9: 1911022.12   74234.54

  Mixed workloads:
    fio (buffered I/O): +[75, 77]%
                IOPS         BW
      5.19-rc1: 1279k        4996MiB/s
      patch1-9: 2252k        8796MiB/s

    memcached (anon): +[13, 15]%
                Ops/sec      KB/sec
      5.19-rc1: 1673524.04   65008.87
      patch1-9: 1911022.12   74234.54

  Configurations:
    (changes since patch 6)

    cat mixed.sh
    modprobe brd rd_nr=2 rd_size=56623104

    swapoff -a
    mkswap /dev/ram0
    swapon /dev/ram0

    mkfs.ext4 /dev/ram1
    mount -t ext4 /dev/ram1 /mnt

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=50000000 --key-pattern=P:P -c 1 -t 36 \
      --ratio 1:0 --pipeline 8 -d 2000

    fio -name=mglru --numjobs=36 --directory=/mnt --size=1408m \
      --buffered=1 --ioengine=io_uring --iodepth=128 \
      --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
      --rw=randread --random_distribution=random --norandommap \
      --time_based --ramp_time=10m --runtime=90m --group_reporting &
    pid=$!

    sleep 200

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=50000000 --key-pattern=R:R -c 1 -t 36 \
      --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed

    kill -INT $pid
    wait

Client benchmark results:
  no change (CONFIG_MEMCG=n)

Link: https://lkml.kernel.org/r/20220918080010.2920238-10-yuzhao@google.com
Change-Id: I590780e6381d577d800d4c4551a702047fc31cc7
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit f76c83378851f8e70f032848c4e61203f39480e4)
Bug: 249601646
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
[backport of the commit 8726e22e86042de4fea4544854f11e60be0c20c7 from android13-5.15 branch]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

BACKPORT: mm: multi-gen LRU: support page table walks

To further exploit spatial locality, the aging prefers to walk page tables
to search for young PTEs and promote hot pages.  A kill switch will be
added in the next patch to disable this behavior.  When disabled, the
aging relies on the rmap only.

NB: this behavior has nothing similar with the page table scanning in the
2.4 kernel [1], which searches page tables for old PTEs, adds cold pages
to swapcache and unmaps them.

To avoid confusion, the term "iteration" specifically means the traversal
of an entire mm_struct list; the term "walk" will be applied to page
tables and the rmap, as usual.

An mm_struct list is maintained for each memcg, and an mm_struct follows
its owner task to the new memcg when this task is migrated.  Given an
lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot pages
before it increments max_seq.

When multiple page table walkers iterate the same list, each of them gets
a unique mm_struct; therefore they can run concurrently.  Page table
walkers ignore any misplaced pages, e.g., if an mm_struct was migrated,
pages it left in the previous memcg will not be promoted when its current
memcg is under reclaim.  Similarly, page table walkers will not promote
pages from nodes other than the one under reclaim.

This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
   page table walkers can skip processes that have been sleeping since
   the last iteration.
2. It uses generational Bloom filters to record populated branches so
   that page table walkers can reduce their search space based on the
   query results, e.g., to skip page tables containing mostly holes or
   misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
   CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
   spanning multiple VMAs. IOW, it finishes all the VMAs within the
   range of the same PMD table before it returns to a PGD table. This
   improves the cache performance for workloads that have large
   numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.

Server benchmark results:
  Single workload:
    fio (buffered I/O): no change

  Single workload:
    memcached (anon): +[8, 10]%
                Ops/sec      KB/sec
      patch1-7: 1147696.57   44640.29
      patch1-8: 1245274.91   48435.66

  Configurations:
    no change

Client benchmark results:
  kswapd profiles:
    patch1-7
      48.16%  lzo1x_1_do_compress (real work)
       8.20%  page_vma_mapped_walk (overhead)
       7.06%  _raw_spin_unlock_irq
       2.92%  ptep_clear_flush
       2.53%  __zram_bvec_write
       2.11%  do_raw_spin_lock
       2.02%  memmove
       1.93%  lru_gen_look_around
       1.56%  free_unref_page_list
       1.40%  memset

    patch1-8
      49.44%  lzo1x_1_do_compress (real work)
       6.19%  page_vma_mapped_walk (overhead)
       5.97%  _raw_spin_unlock_irq
       3.13%  get_pfn_page
       2.85%  ptep_clear_flush
       2.42%  __zram_bvec_write
       2.08%  do_raw_spin_lock
       1.92%  memmove
       1.44%  alloc_zspage
       1.36%  memset

  Configurations:
    no change

Thanks to the following developers for their efforts [3].
  kernel test robot <lkp@intel.com>

[1] https://lwn.net/Articles/23732/
[2] https://llvm.org/docs/ScudoHardenedAllocator.html
[3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/

Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.com
Change-Id: I7ed3daf288e664e15bfd34991a77467a19a4e39a
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit bd74fdaea146029e4fa12c6de89adbe0779348a9)
[ Resolve conflicts in include/linux/memcontrol.h,
  include/linux/mm_types.h ]
Bug: 249601646
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
[backport of the commit 35e2163024497e8140fdfdf97c59f50a67d32092 from android13-5.15 branch]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

BACKPORT: mm: multi-gen LRU: exploit locality in rmap

Searching the rmap for PTEs mapping each page on an LRU list (to test and
clear the accessed bit) can be expensive because pages from different VMAs
(PA space) are not cache friendly to the rmap (VA space).  For workloads
mostly using mapped pages, searching the rmap can incur the highest CPU
cost in the reclaim path.

This patch exploits spatial locality to reduce the trips into the rmap.
When shrink_page_list() walks the rmap and finds a young PTE, a new
function lru_gen_look_around() scans at most BITS_PER_LONG-1 adjacent
PTEs.  On finding another young PTE, it clears the accessed bit and
updates the gen counter of the page mapped by this PTE to
(max_seq%MAX_NR_GENS)+1.

Server benchmark results:
  Single workload:
    fio (buffered I/O): no change

  Single workload:
    memcached (anon): +[3, 5]%
                Ops/sec      KB/sec
      patch1-6: 1106168.46   43025.04
      patch1-7: 1147696.57   44640.29

  Configurations:
    no change

Client benchmark results:
  kswapd profiles:
    patch1-6
      39.03%  lzo1x_1_do_compress (real work)
      18.47%  page_vma_mapped_walk (overhead)
       6.74%  _raw_spin_unlock_irq
       3.97%  do_raw_spin_lock
       2.49%  ptep_clear_flush
       2.48%  anon_vma_interval_tree_iter_first
       1.92%  page_referenced_one
       1.88%  __zram_bvec_write
       1.48%  memmove
       1.31%  vma_interval_tree_iter_next

    patch1-7
      48.16%  lzo1x_1_do_compress (real work)
       8.20%  page_vma_mapped_walk (overhead)
       7.06%  _raw_spin_unlock_irq
       2.92%  ptep_clear_flush
       2.53%  __zram_bvec_write
       2.11%  do_raw_spin_lock
       2.02%  memmove
       1.93%  lru_gen_look_around
       1.56%  free_unref_page_list
       1.40%  memset

  Configurations:
    no change

Link: https://lkml.kernel.org/r/20220918080010.2920238-8-yuzhao@google.com
Change-Id: Iac405b6d42e2e3f632b6748368f61202c164f1ad
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Barry Song <baohua@kernel.org>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 018ee47f14893d500131dfca2ff9f3ff8ebd4ed2)
Bug: 249601646
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
[backport of the commit 009d8570595a28fc61b5db16d93f01033c00a2b2 from android13-5.15 branch]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

BACKPORT: mm: multi-gen LRU: minimal implementation

To avoid confusion, the terms "promotion" and "demotion" will be applied
to the multi-gen LRU, as a new convention; the terms "activation" and
"deactivation" will be applied to the active/inactive LRU, as usual.

The aging produces young generations.  Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS.  The aging promotes
hot pages to the youngest generation when it finds them accessed through
page tables; the demotion of cold pages happens consequently when it
increments max_seq.  Promotion in the aging path does not involve any LRU
list operations, only the updates of the gen counter and
lrugen->nr_pages[]; demotion, unless as the result of the increment of
max_seq, requires LRU list operations, e.g., lru_deactivate_fn().  The
aging has the complexity O(nr_hot_pages), since it is only interested in
hot pages.

The eviction consumes old generations.  Given an lruvec, it increments
min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
A feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types are
available from the same generation.

The protection of pages accessed multiple times through file descriptors
takes place in the eviction path.  Each generation is divided into
multiple tiers.  A page accessed N times through file descriptors is in
tier order_base_2(N).  Tiers do not have dedicated lrugen->lists[], only
bits in page->flags.  The aforementioned feedback loop also monitors
refaults over all tiers and decides when to protect pages in which tiers
(N>1), using the first tier (N=0,1) as a baseline.  The first tier
contains single-use unmapped clean pages, which are most likely the best
choices.  In contrast to promotion in the aging path, the protection of a
page in the eviction path is achieved by moving this page to the next
generation, i.e., min_seq+1, if the feedback loop decides so.  This
approach has the following advantages:

1. It removes the cost of activation in the buffered access path by
   inferring whether pages accessed multiple times through file
   descriptors are statistically hot and thus worth protecting in the
   eviction path.
2. It takes pages accessed through page tables into account and avoids
   overprotecting pages accessed multiple times through file
   descriptors. (Pages accessed through page tables are in the first
   tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
   twice through file descriptors, when under heavy buffered I/O
   workloads.

Server benchmark results:
  Single workload:
    fio (buffered I/O): +[30, 32]%
                IOPS         BW
      5.19-rc1: 2673k        10.2GiB/s
      patch1-6: 3491k        13.3GiB/s

  Single workload:
    memcached (anon): -[4, 6]%
                Ops/sec      KB/sec
      5.19-rc1: 1161501.04   45177.25
      patch1-6: 1106168.46   43025.04

  Configurations:
    CPU: two Xeon 6154
    Mem: total 256G

    Node 1 was only used as a ram disk to reduce the variance in the
    results.

    patch drivers/block/brd.c <<EOF
    99,100c99,100
    < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
    < page = alloc_page(gfp_flags);
    ---
    > gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
    > page = alloc_pages_node(1, gfp_flags, 0);
    EOF

    cat >>/etc/systemd/system.conf <<EOF
    CPUAffinity=numa
    NUMAPolicy=bind
    NUMAMask=0
    EOF

    cat >>/etc/memcached.conf <<EOF
    -m 184320
    -s /var/run/memcached/memcached.sock
    -a 0766
    -t 36
    -B binary
    EOF

    cat fio.sh
    modprobe brd rd_nr=1 rd_size=113246208
    swapoff -a
    mkfs.ext4 /dev/ram0
    mount -t ext4 /dev/ram0 /mnt

    mkdir /sys/fs/cgroup/user.slice/test
    echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
    echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
    fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
      --buffered=1 --ioengine=io_uring --iodepth=128 \
      --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
      --rw=randread --random_distribution=random --norandommap \
      --time_based --ramp_time=10m --runtime=5m --group_reporting

    cat memcached.sh
    modprobe brd rd_nr=1 rd_size=113246208
    swapoff -a
    mkswap /dev/ram0
    swapon /dev/ram0

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
      --ratio 1:0 --pipeline 8 -d 2000

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
      --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed

Client benchmark results:
  kswapd profiles:
    5.19-rc1
      40.33%  page_vma_mapped_walk (overhead)
      21.80%  lzo1x_1_do_compress (real work)
       7.53%  do_raw_spin_lock
       3.95%  _raw_spin_unlock_irq
       2.52%  vma_interval_tree_iter_next
       2.37%  page_referenced_one
       2.28%  vma_interval_tree_subtree_search
       1.97%  anon_vma_interval_tree_iter_first
       1.60%  ptep_clear_flush
       1.06%  __zram_bvec_write

    patch1-6
      39.03%  lzo1x_1_do_compress (real work)
      18.47%  page_vma_mapped_walk (overhead)
       6.74%  _raw_spin_unlock_irq
       3.97%  do_raw_spin_lock
       2.49%  ptep_clear_flush
       2.48%  anon_vma_interval_tree_iter_first
       1.92%  page_referenced_one
       1.88%  __zram_bvec_write
       1.48%  memmove
       1.31%  vma_interval_tree_iter_next

  Configurations:
    CPU: single Snapdragon 7c
    Mem: total 4G

    ChromeOS MemoryPressure [1]

[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/

Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com
Change-Id: I30b26b3086ce1879b83b96eb265f8f0dcb16a1fb
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit ac35a490237446b71e3b4b782b1596967edd0aa8)
[Resolve confilcts in mm/Kconfig, mm/swap.c, mm/vmscan.c]
Bug: 249601646
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
[backport of the commit 53af55e4cca49a4b16946fae3ae6ab9ef3fef079 from android13-5.15 branch]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

BACKPORT: mm: multi-gen LRU: groundwork

Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they are aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages can be evicted regardless of swap
constraints. These three variables are monotonically increasing.

Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in page->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations. The gen counter stores a value within [1,
MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
stores 0.

There are two conceptually independent procedures: "the aging", which
produces young generations, and "the eviction", which consumes old
generations.  They form a closed-loop system, i.e., "the page reclaim".
Both procedures can be invoked from userspace for the purposes of working
set estimation and proactive reclaim.  These techniques are commonly used
to optimize job scheduling (bin packing) in data centers [1][2].

To avoid confusion, the terms "hot" and "cold" will be applied to the
multi-gen LRU, as a new convention; the terms "active" and "inactive" will
be applied to the active/inactive LRU, as usual.

The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one through page tables and the other through file descriptors. The
protection of the former channel is by design stronger because:
1. The uncertainty in determining the access patterns of the former
   channel is higher due to the approximation of the accessed bit.
2. The cost of evicting the former channel is higher due to the TLB
   flushes required and the likelihood of encountering the dirty bit.
3. The penalty of underprotecting the former channel is higher because
   applications usually do not prepare themselves for major page
   faults like they do for blocked I/O. E.g., GUI applications
   commonly use dedicated I/O threads to avoid blocking rendering
   threads.

There are also two access patterns: one with temporal locality and the
other without.  For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is
present; the latter channel is assumed to follow the latter pattern unless
outlying refaults have been observed [3][4].

The next patch will address the "outlying refaults".  Three macros, i.e.,
LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in
this patch to make the entire patchset less diffy.

A page is added to the youngest generation on faulting.  The aging needs
to check the accessed bit at least twice before handing this page over to
the eviction.  The first check takes care of the accessed bit set on the
initial fault; the second check makes sure this page has not been used
since then.  This protocol, AKA second chance, requires a minimum of two
generations, hence MIN_NR_GENS.

[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731
[3] https://lwn.net/Articles/495543/
[4] https://lwn.net/Articles/815342/

Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.com
Change-Id: I7b24d1e9d263e4eb2c2ee23f2eb143824fcb5201
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit ec1c86b25f4bdd9dce6436c0539d2a6ae676e1c4)
[ Resolve conflicts in mm/memory.c, mm/memcontrol.c, mm/Kconfig,
include/linux/mm_inline.h]
Bug: 249601646
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
[backport of the commit f4d4c46c3ad74657f912f112acd6f7b171f991f8 from android13-5.15 branch]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

FROMLIST: mm/vmscan.c: refactor shrink_node()

This patch refactors shrink_node() to improve readability for the
upcoming changes to mm/vmscan.c.

Link: https://lore.kernel.org/lkml/20220309021230.721028-4-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I186f43f946de0d40d54883fb31114840fc749a57
[backport of the commit 8d47a32fa814aa5024253ba0c0a0e82c9ab2d1ae from android13-5.15 branch]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

pgtable: add arch_has_hw_pte_young() fallback

Change-Id: Ie771d5cdd3044cbc4c9918fc1dace1ceb0358ed7
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

pgtable: add pud_index() fallback

Change-Id: I4ef23d27ca85afc9933cf1dc70b8b605ee4f6591
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

atomic: add try_cmpxchg() fallback

Change-Id: I48abe90191d4f8e0adc2c6d191397b3b04418ff1
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

mm, memcg: decouple e{low,min} state mutations from protection checks

mem_cgroup_protected currently is both used to set effective low and min
and return a mem_cgroup_protection based on the result. As a user, this
can be a little unexpected: it appears to be a simple predicate function,
if not for the big warning in the comment above about the order in which
it must be executed.

This change makes it so that we separate the state mutations from the
actual protection checks, which makes it more obvious where we need to be
careful mutating internal state, and where we are simply checking and
don't need to worry about that.

[mhocko@suse.com - don't check protection on root memcgs]

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Change-Id: I916c4cb713450129c7010659e44340bd81a0959e
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Link: http://lkml.kernel.org/r/ff3f915097fcee9f6d7041c084ef92d16aaeb56a.1594638158.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[backport of the commit 45c7f7e1ef17f09fe70bad4b705ce43772153fd7 from mainline]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

mm/thp: make is_huge_zero_pmd() safe and quicker

Most callers of is_huge_zero_pmd() supply a pmd already verified
present; but a few (notably zap_huge_pmd()) do not - it might be a pmd
migration entry, in which the pfn is encoded differently from a present
pmd: which might pass the is_huge_zero_pmd() test (though not on x86,
since L1TF forced us to protect against that); or perhaps even crash in
pmd_page() applied to a swap-like entry.

Make it safe by adding pmd_present() check into is_huge_zero_pmd()
itself; and make it quicker by saving huge_zero_pfn, so that
is_huge_zero_pmd() will not need to do that pmd_page() lookup each time.

__split_huge_pmd_locked() checked pmd_trans_huge() before: that worked,
but is unnecessary now that is_huge_zero_pmd() checks present.

Link: https://lkml.kernel.org/r/21ea9ca-a1f5-8b90-5e88-95fb1c49bbfa@google.com
Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp")
Change-Id: I4b5270605a3753233e3152829180ef86562fa2ad
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[backport of the commit 3b77e8c8cde581dadab9a0f1543a347e24315f11 from mainline]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

include/linux/mm_inline.h: shuffle lru list addition and deletion functions

These functions will call page_lru() in the following patches. Move them
below page_lru() to avoid the forward declaration.

Link: https://lore.kernel.org/linux-mm/20201207220949.830352-3-yuzhao@google.com/
Link: https://lkml.kernel.org/r/20210122220600.906146-3-yuzhao@google.com
Change-Id: I5dd30e1b586878b9276ccc0f9896ab47538451ee
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[backport of the commit f90d8191ac864df33b1898bc7edc54eaa24e22bc from mainline]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

mm/rmap: always do TTU_IGNORE_ACCESS

Since commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic
v2"), the code to check the secondary MMU's page table access bit is
broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the
secondary MMU's page table before the check.  More specifically for those
secondary MMUs which unmap the memory in
mmu_notifier_invalidate_range_start() like kvm.

However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the
absence of TTU_IGNORE_ACCESS and it explicitly performs the page table
access check before trying to unmap the page.  So, at worst the reclaim
will miss accesses in a very short window if we remove page table access
check in unmapping code.

There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg
reclaim.  From memcg reclaim the page_referenced() only account the
accesses from the processes which are in the same memcg of the target page
but the unmapping code is considering accesses from all the processes, so,
decreasing the effectiveness of memcg reclaim.

The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping
code.

Link: https://lkml.kernel.org/r/20201104231928.1494083-1-shakeelb@google.com
Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
Change-Id: If6b3d528bf5d76d2004817da96dfa430e78de42f
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[backport of the commit 013339df116c2ee0d796dd8bfb8f293a2030c063 from mainline]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

mm/lru: introduce TestClearPageLRU()

Currently lru_lock still guards both lru list and page's lru bit, that's
ok.  but if we want to use specific lruvec lock on the page, we need to
pin down the page's lruvec/memcg during locking.  Just taking lruvec lock
first may be undermined by the page's memcg charge/migration.  To fix this
problem, we will clear the lru bit out of locking and use it as pin down
action to block the page isolation in memcg changing.

So now a standard steps of page isolation is following:
1, get_page();        #pin the page avoid to be free
2, TestClearPageLRU(); #block other isolation like memcg change
3, spin_lock on lru_lock; #serialize lru list access
4, delete page from lru list;

This patch start with the first part: TestClearPageLRU, which combines
PageLRU check and ClearPageLRU into a macro func TestClearPageLRU.  This
function will be used as page isolation precondition to prevent other
isolations some where else.  Then there are may !PageLRU page on lru list,
need to remove BUG() checking accordingly.

There 2 rules for lru bit now:
1, the lru bit still indicate if a page on lru list, just in some
   temporary moment(isolating), the page may have no lru bit when
   it's on lru list.  but the page still must be on lru list when the
   lru bit set.
2, have to remove lru bit before delete it from lru list.

As Andrew Morton mentioned this change would dirty cacheline for a page
which isn't on the LRU.  But the loss would be acceptable in Rong Chen
<rong.a.chen@intel.com> report:
https://lore.kernel.org/lkml/20200304090301.GB5972@shao2-debian/

Link: https://lkml.kernel.org/r/1604566549-62481-15-git-send-email-alex.shi@linux.alibaba.com
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Change-Id: I77f12be15d182f49521a2c2bfaf8fa31fda4212d
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mika Penttilä <mika.penttila@nextfour.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[backport of the commit d25b5bd8a8f420b15517c19c4626c0c009f72a63 from mainline]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

mm/memcg: warning on !memcg after readahead page charged

Add VM_WARN_ON_ONCE_PAGE() macro.

Since readahead page is charged on memcg too, in theory we don't have to
check this exception now. Before safely remove them all, add a warning
for the unexpected !memcg.

Link: https://lkml.kernel.org/r/1604283436-18880-3-git-send-email-alex.shi@linux.alibaba.com
Change-Id: I0de058b8c4fd406c7d02ab837388d77c93ef778a
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[backport of the commit a4055888629bc0467d12d912cd7c90acdf3d9b12 from mainline]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

mm: use self-explanatory macros rather than "2"

Change-Id: Ia0efe80fa2c62163b0d58a382aef1e0881cba15b
Signed-off-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Link: http://lkml.kernel.org/r/20200831175042.3527153-2-yuzhao@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[backport of the commit ed0173733dd468883198c3136284394320b8fad6 from mainline]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

mm/vma: make vma_is_accessible() available for general use

Lets move vma_is_accessible() helper to include/linux/mm.h which makes it
available for general use. While here, this replaces all remaining open
encodings for VMA access check with vma_is_accessible().

Change-Id: I475ac8df7e31c848d3c84688f7fb9e3f27c832f5
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Guo Ren <guoren@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Guo Ren <guoren@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paulburton@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/1582520593-30704-3-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[backport of the commit 3122e80efc0faf4a2accba7a46c7ed795edbfded from mainline]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

mm/vmscan.c: clean code by removing unnecessary assignment

Previously 0 was assigned to variable 'lruvec_size', but the variable was
never read later. So the assignment can be removed.

Fixes: f87bccde6a7d ("mm/vmscan: remove unused lru_pages argument")
Change-Id: Iddd0982d271ee9ab9387195b87a4f608850cadc7
Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: http://lkml.kernel.org/r/20200229214022.11853-1-mateusznosek0@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[backport of the commit e072bff60a29c82b1536d78c412f009d1a1de4cf from mainline]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

mm: pagewalk: add p4d_entry() and pgd_entry()

pgd_entry() and pud_entry() were removed by commit 0b1fbfe50006c410
("mm/pagewalk: remove pgd_entry() and pud_entry()") because there were no
users.  We're about to add users so reintroduce them, along with
p4d_entry() as we now have 5 levels of tables.

Note that commit a00cc7d9dd93d66a ("mm, x86: add support for PUD-sized
transparent hugepages") already re-added pud_entry() but with different
semantics to the other callbacks.  This commit reverts the semantics back
to match the other callbacks.

To support hmm.c which now uses the new semantics of pud_entry() a new
member ('action') of struct mm_walk is added which allows the callbacks to
either descend (ACTION_SUBTREE, the default), skip (ACTION_CONTINUE) or
repeat the callback (ACTION_AGAIN).  hmm.c is then updated to call
pud_trans_huge_lock() itself and make use of the splitting/retry logic of
the core code.

After this change pud_entry() is called for all entries, not just
transparent huge pages.

[arnd@arndb.de: fix unused variable warning]
Link: http://lkml.kernel.org/r/20200107204607.1533842-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/20191218162402.45610-12-steven.price@arm.com
Change-Id: Iefcf3d5f987b2fcc64262da127543ae11207086c
Signed-off-by: Steven Price <steven.price@arm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[backport of the commit 3afc423632a194d7d6afef34e4bb98f804cd071d from mainline]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

mm: add generic p?d_leaf() macros

Patch series "Generic page walk and ptdump", v17.

Many architectures current have a debugfs file for dumping the kernel page
tables.  Currently each architecture has to implement custom functions for
this because the details of walking the page tables used by the kernel are
different between architectures.

This series extends the capabilities of walk_page_range() so that it can
deal with the page tables of the kernel (which have no VMAs and can
contain larger huge pages than exist for user space).  A generic PTDUMP
implementation is the implemented making use of the new functionality of
walk_page_range() and finally arm64 and x86 are switch to using it,
removing the custom table walkers.

To enable a generic page table walker to walk the unusual mappings of the
kernel we need to implement a set of functions which let us know when the
walker has reached the leaf entry.  After a suggestion from Will Deacon
I've chosen the name p?d_leaf() as this (hopefully) describes the purpose
(and is a new name so has no historic baggage).  Some architectures have
p?d_large macros but this is easily confused with "large pages".

This series ends with a generic PTDUMP implemention for arm64 and x86.

Mostly this is a clean up and there should be very little functional
change.  The exceptions are:

* arm64 PTDUMP debugfs now displays pages which aren't present (patch 22).

* arm64 has the ability to efficiently process KASAN pages (which
  previously only x86 implemented).  This means that the combination of
  KASAN and DEBUG_WX is now useable.

This patch (of 23):

Exposing the pud/pgd levels of the page tables to walk_page_range() means
we may come across the exotic large mappings that come with large areas of
contiguous memory (such as the kernel's linear map).

For architectures that don't provide all p?d_leaf() macros, provide
generic do nothing default that are suitable where there cannot be leaf
pages at that level.  Futher patches will add implementations for
individual architectures.

The name p?d_leaf() is chosen to minimize the confusion with existing uses
of "large" pages and "huge" pages which do not necessary mean that the
entry is a leaf (for example it may be a set of contiguous entries that
only take 1 TLB slot).  For the purpose of walking the page tables we
don't need to know how it will be represented in the TLB, but we do need
to know for sure if it is a leaf of the tree.

Link: http://lkml.kernel.org/r/20191218162402.45610-2-steven.price@arm.com
Change-Id: Id0b9ea4602b9694a0354f056992ce74ab91d3758
Signed-off-by: Steven Price <steven.price@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[backport of the commit 93fab1b22ef7c4abcbc760ce4432762b02e7f3d1 from mainline]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

mm: vmscan: enforce inactive:active ratio at the reclaim root

We split the LRU lists into inactive and an active parts to maximize
workingset protection while allowing just enough inactive cache space to
faciltate readahead and writeback for one-off file accesses (e.g.  a
linear scan through a file, or logging); or just enough inactive anon to
maintain recent reference information when reclaim needs to swap.

With cgroups and their nested LRU lists, we currently don't do this
correctly.  While recursive cgroup reclaim establishes a relative LRU
order among the pages of all involved cgroups, inactive:active size
decisions are done on a per-cgroup level.  As a result, we'll reclaim a
cgroup's workingset when it doesn't have cold pages, even when one of its
siblings has plenty of it that should be reclaimed first.

For example: workload A has 50M worth of hot cache but doesn't do any
one-off file accesses; meanwhile, parallel workload B scans files and
rarely accesses the same page twice.

If these workloads were to run in an uncgrouped system, A would be
protected from the high rate of cache faults from B.  But if they were put
in parallel cgroups for memory accounting purposes, B's fast cache fault
rate would push out the hot cache pages of A.  This is unexpected and
undesirable - the "scan resistance" of the page cache is broken.

This patch moves inactive:active size balancing decisions to the root of
reclaim - the same level where the LRU order is established.

It does this by looking at the recursive size of the inactive and the
active file sets of the cgroup subtree at the beginning of the reclaim
cycle, and then making a decision - scan or skip active pages - that
applies throughout the entire run and to every cgroup involved.

With that in place, in the test above, the VM will recognize that there
are plenty of inactive pages in the combined cache set of workloads A and
B and prefer the one-off cache in B over the hot pages in A.  The scan
resistance of the cache is restored.

[cai@lca.pw: fix some -Wenum-conversion warnings]
Link: http://lkml.kernel.org/r/1573848697-29262-1-git-send-email-cai@lca.pw
Link: http://lkml.kernel.org/r/20191107205334.158354-4-hannes@cmpxchg.org
Change-Id: If5fb00a0b06f616e0c096e95a21513bfa8a58e1b
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[backport of the commit b91ac374346ba206cfd568bb0ab830af6b205cfd from mainline]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

mm: vmscan: detect file thrashing at the reclaim root

We use refault information to determine whether the cache workingset is
stable or transitioning, and dynamically adjust the inactive:active file
LRU ratio so as to maximize protection from one-off cache during stable
periods, and minimize IO during transitions.

With cgroups and their nested LRU lists, we currently don't do this
correctly.  While recursive cgroup reclaim establishes a relative LRU
order among the pages of all involved cgroups, refaults only affect the
local LRU order in the cgroup in which they are occuring.  As a result,
cache transitions can take longer in a cgrouped system as the active pages
of sibling cgroups aren't challenged when they should be.

[ Right now, this is somewhat theoretical, because the siblings, under
  continued regular reclaim pressure, should eventually run out of
  inactive pages - and since inactive:active *size* balancing is also
  done on a cgroup-local level, we will challenge the active pages
  eventually in most cases. But the next patch will move that relative
  size enforcement to the reclaim root as well, and then this patch
  here will be necessary to propagate refault pressure to siblings. ]

This patch moves refault detection to the root of reclaim.  Instead of
remembering the cgroup owner of an evicted page, remember the cgroup that
caused the reclaim to happen.  When refaults later occur, they'll
correctly influence the cross-cgroup LRU order that reclaim follows.

I.e.  if global reclaim kicked out pages in some subgroup A/B/C, the
refault of those pages will challenge the global LRU order, and not just
the local order down inside C.

[hannes@cmpxchg.org:  use page_memcg() instead of another lookup]
Link: http://lkml.kernel.org/r/20191115160722.GA309754@cmpxchg.org
Link: http://lkml.kernel.org/r/20191107205334.158354-3-hannes@cmpxchg.org
Change-Id: Ib68cb4395227a8f49e8cf0893208a9b3a415f7c5
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[backport of the commit b910718a948a9120d90faf632b33ed23c70e266a from mainline]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

mm: vmscan: move file exhaustion detection to the node level

Patch series "mm: fix page aging across multiple cgroups".

When applications are put into unconfigured cgroups for memory accounting
purposes, the cgrouping itself should not change the behavior of the page
reclaim code.  We expect the VM to reclaim the coldest pages in the
system.  But right now the VM can reclaim hot pages in one cgroup while
there is eligible cold cache in others.

This is because one part of the reclaim algorithm isn't truly cgroup
hierarchy aware: the inactive/active list balancing.  That is the part
that is supposed to protect hot cache data from one-off streaming IO.

The recursive cgroup reclaim scheme will scan and rotate the physical LRU
lists of each eligible cgroup at the same rate in a round-robin fashion,
thereby establishing a relative order among the pages of all those
cgroups.  However, the inactive/active balancing decisions are made
locally within each cgroup, so when a cgroup is running low on cold pages,
its hot pages will get reclaimed - even when sibling cgroups have plenty
of cold cache eligible in the same reclaim run.

For example:

   [root@ham ~]# head -n1 /proc/meminfo
   MemTotal:        1016336 kB

   [root@ham ~]# ./reclaimtest2.sh
   Establishing 50M active files in cgroup A...
   Hot pages cached: 12800/12800 workingset-a
   Linearly scanning through 18G of file data in cgroup B:
   real    0m4.269s
   user    0m0.051s
   sys     0m4.182s
   Hot pages cached: 134/12800 workingset-a

The streaming IO in B, which doesn't benefit from caching at all, pushes
out most of the workingset in A.

Solution

This series fixes the problem by elevating inactive/active balancing
decisions to the toplevel of the reclaim run.  This is either a cgroup
that hit its limit, or straight-up global reclaim if there is physical
memory pressure.  From there, it takes a recursive view of the cgroup
subtree to decide whether page deactivation is necessary.

In the test above, the VM will then recognize that cgroup B has plenty of
eligible cold cache, and that the hot pages in A can be spared:

   [root@ham ~]# ./reclaimtest2.sh
   Establishing 50M active files in cgroup A...
   Hot pages cached: 12800/12800 workingset-a
   Linearly scanning through 18G of file data in cgroup B:
   real    0m4.244s
   user    0m0.064s
   sys     0m4.177s
   Hot pages cached: 12800/12800 workingset-a

Implementation

Whether active pages can be deactivated or not is influenced by two
factors: the inactive list dropping below a minimum size relative to the
active list, and the occurence of refaults.

This patch series first moves refault detection to the reclaim root, then
enforces the minimum inactive size based on a recursive view of the cgroup
tree's LRUs.

History

Note that this actually never worked correctly in Linux cgroups.  In the
past it worked for global reclaim and leaf limit reclaim only (we used to
have two physical LRU linkages per page), but it never worked for
intermediate limit reclaim over multiple leaf cgroups.

We're noticing this now because 1) we're putting everything into cgroups
for accounting, not just the things we want to control and 2) we're moving
away from leaf limits that invoke reclaim on individual cgroups, toward
large tree reclaim, triggered by high-level limits, or physical memory
pressure that is influenced by local protections such as memory.low and
memory.min instead.

This patch (of 3):

When file pages are lower than the watermark on a node, we try to force
scan anonymous pages to counter-act the balancing algorithms preference
for new file pages when they are likely thrashing.  This is a node-level
decision, but it's currently made each time we look at an lruvec.  This is
unnecessarily expensive and also a layering violation that makes the code
harder to understand.

Clean this up by making the check once per node and setting a flag in the
scan_control.

Link: http://lkml.kernel.org/r/20191107205334.158354-2-hannes@cmpxchg.org
Change-Id: Ia0cf370aad8c58e865fae6c0e8235e06b760aca1
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[backport of the commit 53138cea7f398d2cdd0fa22adeec7e16093e1ebd from mainline]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

mm: vmscan: harmonize writeback congestion tracking for nodes & memcgs

The current writeback congestion tracking has separate flags for kswapd
reclaim (node level) and cgroup limit reclaim (memcg-node level). This is
unnecessarily complicated: the lruvec is an existing abstraction layer for
that node-memcg intersection.

Introduce lruvec->flags and LRUVEC_CONGESTED. Then track that at the
reclaim root level, which is either the NUMA node for global reclaim, or
the cgroup-node intersection for cgroup reclaim.

Link: http://lkml.kernel.org/r/20191022144803.302233-9-hannes@cmpxchg.org
Change-Id: Iaa10ac1d8ee7de5c6f2563ac5ff36ae9bae876a2
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[backport of the commit 1b05117df78e035afb5f66ef50bf8750d976ef08 from mainline]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

mm: vmscan: split shrink_node() into node part and memcgs part

This function is getting long and unwieldy, split out the memcg bits.

The updated shrink_node() handles the generic (node) reclaim aspects:
  - global vmpressure notifications
  - writeback and congestion throttling
  - reclaim/compaction management
  - kswapd giving up on unreclaimable nodes

It then calls a new shrink_node_memcgs() which handles cgroup specifics:
  - the cgroup tree traversal
  - memory.low considerations
  - per-cgroup slab shrinking callbacks
  - per-cgroup vmpressure notifications

[hannes@cmpxchg.org: rename "root" to "target_memcg", per Roman]
Link: http://lkml.kernel.org/r/20191025143640.GA386981@cmpxchg.org
Link: http://lkml.kernel.org/r/20191022144803.302233-8-hannes@cmpxchg.org
Change-Id: Ibef223d4f49740dc2ab2e86f2b7c1e529a9643e2
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[backport of the commit 0f6a5cff43d3bcd6aa54c9af267737249d02aa21 from mainline]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

mm: vmscan: turn shrink_node_memcg() into shrink_lruvec()

An lruvec holds LRU pages owned by a certain NUMA node and cgroup.
Instead of awkwardly passing around a combination of a pgdat and a memcg
pointer, pass down the lruvec as soon as we can look it up.

Nested callers that need to access node or cgroup properties can look them
them up if necessary, but there are only a few cases.

Link: http://lkml.kernel.org/r/20191022144803.302233-7-hannes@cmpxchg.org
Change-Id: Ice444a0c2c1812301d836c9a0d38532988362ea3
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[backport of the commit afaf07a65ddbdd70871cc3b81463f2a8f3884b6f from mainline]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

mm: vmscan: replace shrink_node() loop with a retry jump

Most of the function body is inside a loop, which imposes an additional
indentation and scoping level that makes the code a bit hard to follow and
modify.

The looping only happens in case of reclaim-compaction, which isn't the
common case. So rather than adding yet another function level to the
reclaim path and have every reclaim invocation go through a level that
only exists for one specific cornercase, use a retry goto.

Link: http://lkml.kernel.org/r/20191022144803.302233-6-hannes@cmpxchg.org
Change-Id: Iab1ead6005f957bd8bbb3fcfa48e63d0ae3592a3
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[backport of the commit d2af339706be318dadcbe14c8935426ff401d7b1 from mainline]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

mm: vmscan: naming fixes: global_reclaim() and sane_reclaim()

Seven years after introducing the global_reclaim() function, I still have
to double take when reading a callsite.  I don't know how others do it,
this is a terrible name.

Invert the meaning and rename it to cgroup_reclaim().

[ After all, "global reclaim" is just regular reclaim invoked from the
  page allocator. It's reclaim on behalf of a cgroup limit that is a
  special case of reclaim, and should be explicit - not the reverse. ]

sane_reclaim() isn't very descriptive either: it tests whether we can use
the regular writeback throttling - available during regular page reclaim
or cgroup2 limit reclaim - or need to use the broken
wait_on_page_writeback() method.  Use "writeback_throttling_sane()".

Link: http://lkml.kernel.org/r/20191022144803.302233-5-hannes@cmpxchg.org
Change-Id: Ie173107736712471293c47ee9692c3a98b4ca7da
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[backport of the commit b5ead35e7e1d3434ce436dfcb2af32820ce54589 from mainline]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

mm: vmscan: move inactive_list_is_low() swap check to the caller

inactive_list_is_low() should be about one thing: checking the ratio
between inactive and active list. Kitchensink checks like the one for
swap space makes the function hard to use and modify its callsites.
Luckly, most callers already have an understanding of the swap situation,
so it's easy to clean up.

get_scan_count() has its own, memcg-aware swap check, and doesn't even get
to the inactive_list_is_low() check on the anon list when there is no swap
space available.

shrink_list() is called on the results of get_scan_count(), so that check
is redundant too.

age_active_anon() has its own totalswap_pages check right before it checks
the list proportions.

The shrink_node_memcg() site is the only one that doesn't do its own swap
check. Add it there.

Then delete the swap check from inactive_list_is_low().

Link: http://lkml.kernel.org/r/20191022144803.302233-4-hannes@cmpxchg.org
Change-Id: I16e03de6f88291d52fa3d6be5a999d9a4bd4f450
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[backport of the commit a108629149cc63cfb6fd446184e3e578e04bcfd1 from mainline]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

mm: clean up and clarify lruvec lookup procedure

There is a per-memcg lruvec and a NUMA node lruvec.  Which one is being
used is somewhat confusing right now, and it's easy to make mistakes -
especially when it comes to global reclaim.

How it works: when memory cgroups are enabled, we always use the
root_mem_cgroup's per-node lruvecs.  When memory cgroups are not compiled
in or disabled at runtime, we use pgdat->lruvec.

Document that in a comment.

Due to the way the reclaim code is generalized, all lookups use the
mem_cgroup_lruvec() helper function, and nobody should have to find the
right lruvec manually right now.  But to avoid future mistakes, rename the
pgdat->lruvec member to pgdat->__lruvec and delete the convenience wrapper
that suggests it's a commonly accessed member.

While in this area, swap the mem_cgroup_lruvec() argument order.  The name
suggests a memcg operation, yet it takes a pgdat first and a memcg second.
I have to double take every time I call this.  Fix that.

Link: http://lkml.kernel.org/r/20191022144803.302233-3-hannes@cmpxchg.org
Change-Id: Ibe111f47294a8067dbc57683a84583a33477847b
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[backport of the commit 2230dc207a325970d54c3ca1efe896a57ea82499 from mainline]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

mm: vmscan: simplify lruvec_lru_size()

Patch series "mm: vmscan: cgroup-related cleanups".

Here are 8 patches that clean up the reclaim code's interaction with
cgroups a bit. They're not supposed to change any behavior, just make
the implementation easier to understand and work with.

This patch (of 8):

This function currently takes the node or lruvec size and subtracts the
zones that are excluded by the classzone index of the allocation. It uses
four different types of counters to do this.

Just add up the eligible zones.

[cai@lca.pw: fix an undefined behavior for zone id]
Link: http://lkml.kernel.org/r/20191108204407.1435-1-cai@lca.pw
[akpm@linux-foundation.org: deal with the MAX_NR_ZONES special case. per Qian Cai]
Link: http://lkml.kernel.org/r/64E60F6F-7582-427B-8DD5-EF97B1656F5A@lca.pw
Link: http://lkml.kernel.org/r/20191022144803.302233-2-hannes@cmpxchg.org
Change-Id: I2d00ef6cd83336d904ac58cea129404c6ba5451c
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[backport of the commit 78a3ee9c29c881feeb716fe70218c4fd2a7c342c from mainline]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

mm/vmscan: remove unused lru_pages argument

Since 9092c71bb724 ("mm: use sc->priority for slab shrink targets") the
argument 'unsigned long *lru_pages' passed around with no purpose. Remove
it.

Link: http://lkml.kernel.org/r/20190228083329.31892-4-aryabinin@virtuozzo.com
Change-Id: I8a61dd9dab063e9db2d58aafe71f5a6238a3a525
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[backport of the commit c42875e38b6fb9546f665a397b66e3a4cc2e5041 from mainline]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

ARM64: tizen_bcm2711_defconfig: Enable USB_PRINTER config

Enable USB_PRINTER config to suppot USB printer device.

Change-Id: I2068a283928c8f3c85d5f31b25a13300cdc52783
Signed-off-by: Chanwoo Choi <cw00.choi@samsung.com>

ARM: tizen_bcm2711_defconfig: Enable USB_PRINTER config

Enable USB_PRINTER config to suppot USB printer device.

Change-Id: Ic63797d93520e4bc50175c53c7e7328b55a0f724
Signed-off-by: Chanwoo Choi <cw00.choi@samsung.com>

ARM64: tizen_bcm2711_defconfig: Enable CONFIG_CFG80211_CRDA_SUPPORT

Enable CONFIG_CFG80211_CRDA_SUPPORT.

Change-Id: If60cab97a05ca8bc01633255ede56279fcc599a8
Signed-off-by: Jaehoon Chung <jh80.chung@samsung.com>

Revert "ARM: configs: tizen_bcm2711_defconfig: Disable CONFIG_CFG80211_CRDA_SUPPORT"

This reverts commit 6485924bd76b70c8ed6ba334ef9b5045f4a3686d.
- Enable CONFIG_CFG80211_CRDA_SUPPORT to use country code.

Change-Id: I951b32c86611b6240991e9bb8b362c98bb56b72c
Signed-off-by: Jaehoon Chung <jh80.chung@samsung.com>

media: uvcvideo: Add a probe quirk to Jieli Technology USB PHY 2.0 (1224:2a25)

Repeated video request on Jieli Technology USB PHY 2.0 (1224:2a25)
device causes data stall with below error until reconnection:
uvcvideo: Failed to set UVC probe control : -32 (exp. 26).

To resolve the wrong state, add PROBE quirk bits.

Change-Id: I5efba6ac26d5eea70e2227f9ff9801dd5d8d4790
Signed-off-by: Seung-Woo Kim <sw0312.kim@samsung.com>

ARM/ARM64: defconfig: disable SECURITY_SMACK_NETFILTER config

Disable SECURITY_SMACK_NETFILTER configuration.

Change-Id: Id95847392ac2bf53b92dae5ee2c0d4b5c1e41ece
Signed-off-by: Jaehoon Chung <jh80.chung@samsung.com>

ARM: tizen_bcm2711_defconfig: Enable OVERLAY_FS

Enable CONFIG_OVERLAY_FS for tizen application space.

Change-Id: I46b011effa26e1f7ef9acf5f18d52bc7b54452c5
Signed-off-by: Seung-Woo Kim <sw0312.kim@samsung.com>

ARM64: tizen_bcm2711_defconfig: Enable OVERLAY_FS

Enable CONFIG_OVERLAY_FS for tizen application space.

Change-Id: I1b2aafaeea24b6ef6b07d3c57d542dce13c53b8f
Signed-off-by: Seung-Woo Kim <sw0312.kim@samsung.com>

usb: gadget: f_fs: Fix use-after-free for unbind with remaining io

If usb has stall, then there can be remaining submitted io and
unbinding f_fs with the remaining io, there is use-after-free.
Fix the use-after-free by checking endpoint after wait.

This fixes following kasan warning:
   BUG: KASAN: use-after-free in ffs_epfile_io+0x654/0xb58
   Read of size 4 at addr ffffffc0a44e65dc by task mtp-responder/5117
   ...
   [<ffffff900a037794>] ffs_epfile_io+0x654/0xb58
   [<ffffff900a03818c>] ffs_epfile_read_iter+0x1ac/0x3e0
   ...

   Allocated by task 3869:
   ...
    __kmalloc+0x234/0x760
    _ffs_func_bind+0x264/0x7c8
    ffs_func_bind+0xe8/0x650
    usb_add_function+0x13c/0x378
   ...
   Freed by task 3869:
   ...
    kfree+0xa4/0x750
    ffs_func_unbind+0x150/0x248
    purge_configs_funcs+0x1a0/0x310
   ...

Signed-off-by: Seung-Woo Kim <sw0312.kim@samsung.com>
[dwoo08.lee: cherry-picked from linux-amlogic commit 5dd3ffecd46f to prevent use-after-free when f_fs is unbound before all requests are over]
Signed-off-by: Dongwoo Lee <dwoo08.lee@samsung.com>
Change-Id: Idf2391c53ca0f90fc9484d725304b88fc57fa8a6

kdbus: build and package kdbus-tests

Change-Id: Ie5a32cb744a28b242aac1f43e879306a43795bb5
Signed-off-by: Łukasz Stelmach <l.stelmach@samsung.com>

kdbus: test suite changed to common format

This commit adapts the kdbus test suite to use with
dbus-integration-tests.

Change-Id: Ifee21253f4e3c732b27517a0c566d4b9a569d5af
Signed-off-by: Adrian Szyndela <adrian.s@samsung.com>
Signed-off-by: Łukasz Stelmach <l.stelmach@samsung.com>

ARM/ARM64: tizen_bcm2711_defconfig: Enable KDBUS

Change-Id: I3f40022fa9bbb04a0cbcca1fde45a569ca2c4538
Signed-off-by: Łukasz Stelmach <l.stelmach@samsung.com>
Signed-off-by: Seung-Woo Kim <sw0312.kim@samsung.com>

kdbus: porting to to 5.4

The following changes were made to adapt kdbus driver to 5.4 kernel

- use KERNEL_DS instead of get_ds()
- remove ITER_KVEC flag
- remove 'type' argument from access_ok()
- use memfd_fcntl() instead of shmem_get_seals()
- use uapi/linux/mount.h

Fixes: 736706bee329 ("get rid of legacy 'get_ds()' function")
Fixes: aa563d7bca6e ("iov_iter: Separate type from direction and use accessor functions")
Fixes: 96d4f267e40f ("Remove 'type' argument from access_ok() function")
Fixes: 5aadc431a593 ("shmem: rename functions that are memfd-related")
Fixes: 5d752600a8c3 ("mm: restructure memfd code")
Fixes: e262e32d6bde ("vfs: Suppress MS_* flag defs within the kernel unless explicitly enabled")
Change-Id: I8d2b3db1c83bb21114554ba9eb38e5e439f5c141
Signed-off-by: Łukasz Stelmach <l.stelmach@samsung.com>

kdbus: Revert "fs: unexport poll_schedule_timeout"

This reverts commit 8f546ae1fc5ce8396827d4868c7eee1f1cc6947a.

Change-Id: I5429471eeb092c55a50e37c0b642d50d69daebc7
Signed-off-by: Łukasz Stelmach <l.stelmach@samsung.com>

kdbus: porting to 4.14

This commit makes kdbus driver working with kernel 4.14.
It also enabled compilation of the driver if enabled in config.

The list of changes needed to make it compile with kernel 4.14:
- Add missing includes for <linux/cred.h>
- put inode_lock()/inode_unlock() in place of explicit mutex locking
- replace CURRENT_TIME with proper call to current_time()
- replace PAGE_CACHE_* with PAGE_*
- replace GFP_TEMPORARY with GFP_KERNEL
- use kvecs for kernel memory, iovec for user memory instead of only iovec
for both
- fix usage of task_cgroup_path()
- replace GROUP_AT usage with 'gid' field dereference
- add 0 as an argument to init_name_hash
- add 0 as flags as an argument to vfs_iter_write
- replace __vfs_read with kernel_read to allow compilation into module

Change-Id: I3dd066ab531d0d3f7082556e4d38381961998015
Signed-off-by: Adrian Szyndela <adrian.s@samsung.com>
Signed-off-by: Łukasz Stelmach <l.stelmach@samsung.com>

kdbus: the driver, original and non-working

[based on commit 216823ac83c0ab89348e2ed6f66179f53626586e]

Introduce the kdbus driver again. This driver worked previously
on kernel 4.1. This is the source code taken from the working driver.
It is non-working and disabled. It is a base for porting.

The documentation is moved from Documentation to ipc/kdbus/Documentation.

The references to kdbus source code are commented out or removed in Makefiles.

Original authors of the files are those from commit 216823ac83c0ab8934.

Cherry-picked from 4.14 commit 970070c4f68f113284f86cf7b6fbd23d6b35b511.

Change-Id: Id60af5faf794fc4ae7122976621076f1021f6c38
Signed-off-by: Adrian Szyndela <adrian.s@samsung.com>
Signed-off-by: Hyotaek Shim <hyotaek.shim@samsung.com>
Signed-off-by: Łukasz Stelmach <l.stelmach@samsung.com>

spec: set CONFIG_LOCALVERSION

Set CONFIG_LOCALVERSION insetad of EXTRAVERSION in Makefile to carry
information about the platform variant.

Change-Id: If1650692162832ea86988e4adb610ce0148edde3
Signed-off-by: Łukasz Stelmach <l.stelmach@samsung.com>

spec: support gbs(1) incremental builds

gbs(1) enable incremental builds that are significantly faster, hoewever,
spec files need some adjustments.

+ You can't use -n <name> with %setup
+ There is no point in making mrproper
+ use rsync(1) instead of cp(1)+find(1) to copy devel files

Other changes are:
+ use make headers_install to install headers in the buildroot
+ add _smp_mflags to make(1) things faster

Change-Id: Ia14fbe7acab8b83683f9e07757b4f62a7657b2a5
Signed-off-by: Łukasz Stelmach <l.stelmach@samsung.com>

script: adjust ccache support

Change-Id: I735295fc21cf7caa29b84548a8f21449ed97800e
Signed-off-by: Łukasz Stelmach <l.stelmach@samsung.com>

script: increase the number of concurrent compilers

It is safe to run as many compilation processes as twice the number of
CPUs. This is the default settings of RPM building process in Tizen.

Change-Id: Id68d9d31bc54da11c8732689645408241a517887
Signed-off-by: Łukasz Stelmach <l.stelmach@samsung.com>

script: use $NCPUS for arm64 builds too

Fixes: b371be68152e ("script: Fix to use NCPUS dynamically")
Change-Id: I32b5c08f4b00ca24032a2f9355d308f3df825fbd
Signed-off-by: Łukasz Stelmach <l.stelmach@samsung.com>

ARM: mm: Free memblock from free_initrd_mem()

Even after free_initrd_mem(), memblock for initrd remains. Free
memblock for initrd from free_initrd_mem().

Reported-by: Jaehoon Chung <jh80.chung@samsung.com>
Signed-off-by: Seung-Woo Kim <sw0312.kim@samsung.com>
[sw0312.kim: port mainline posted patch to 5.4.y]
Signed-off-by: Seung-Woo Kim <sw0312.kim@samsung.com>
Change-Id: I42a7461558868cdfdc559959112c81bc4c8b10c5

ARM: dts: bcm2711-rpi-4-b: Adjust CMA size due to the OOM issue

Since 32-bit kernel has a limitation in the lowmem size, big CMA size
severely affect on available memory for kernel. Because of that, some test
cases make OOM kills due to the exhaustion in kernel available memory that
was reported by Jaehoon Chung (jh80.chung@samsung.com).

To handle this, I adjusted CMA size to 400MB while it meets the requirement
for UHD resolution.

Change-Id: I189699d4ce5f2866c01edf68c8138e9278679188
Signed-off-by: Sung-hun Kim <sfoon.kim@samsung.com>

mm, page_alloc: fix core hung in free_pcppages_bulk()

commit 88e8ac11d2ea3acc003cf01bb5a38c8aa76c3cfd upstream.

The following race is observed with the repeated online, offline and a
delay between two successive online of memory blocks of movable zone.

P1 P2

Online the first memory block in
the movable zone. The pcp struct
values are initialized to default
values,i.e., pcp->high = 0 &
pcp->batch = 1.

Allocate the pages from the
movable zone.

Try to Online the second memory
block in the movable zone thus it
entered the online_pages() but yet
to call zone_pcp_update().
This process is entered into
the exit path thus it tries
to release the order-0 pages
to pcp lists through
free_unref_page_commit().
As pcp->high = 0, pcp->count = 1
proceed to call the function
free_pcppages_bulk().
Update the pcp values thus the
new pcp values are like, say,
pcp->high = 378, pcp->batch = 63.
Read the pcp's batch value using
READ_ONCE() and pass the same to
free_pcppages_bulk(), pcp values
passed here are, batch = 63,
count = 1.

Since num of pages in the pcp
lists are less than ->batch,
then it will stuck in
while(list_empty(list)) loop
with interrupts disabled thus
a core hung.

Avoid this by ensuring free_pcppages_bulk() is called with proper count of
pcp list pages.

The mentioned race is some what easily reproducible without [1] because
pcp's are not updated for the first memory block online and thus there is
a enough race window for P2 between alloc+free and pcp struct values
update through onlining of second memory block.

With [1], the race still exists but it is very narrow as we update the pcp
struct values for the first memory block online itself.

This is not limited to the movable zone, it could also happen in cases
with the normal zone (e.g., hotplug to a node that only has DMA memory, or
no other memory yet).

[1]: https://patchwork.kernel.org/patch/11696389/

Fixes: 5f8dcc21211a ("page-allocator: split per-cpu list into one-list-per-migrate-type")
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: <stable@vger.kernel.org> [2.6+]
Link: http://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeaurora.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[sw0312.kim: cherry-pick linux-5.4.y commit 0cfb9320d00c to resolved too larget cma issue]
Signed-off-by: Seung-Woo Kim <sw0312.kim@samsung.com>
Change-Id: I55f884f9950a35c8d75affebe426d176cd91702b

mm: include CMA pages in lowmem_reserve at boot

commit e08d3fdfe2dafa0331843f70ce1ff6c1c4900bf4 upstream.

The lowmem_reserve arrays provide a means of applying pressure against
allocations from lower zones that were targeted at higher zones.  Its
values are a function of the number of pages managed by higher zones and
are assigned by a call to the setup_per_zone_lowmem_reserve() function.

The function is initially called at boot time by the function
init_per_zone_wmark_min() and may be called later by accesses of the
/proc/sys/vm/lowmem_reserve_ratio sysctl file.

The function init_per_zone_wmark_min() was moved up from a module_init to
a core_initcall to resolve a sequencing issue with khugepaged.
Unfortunately this created a sequencing issue with CMA page accounting.

The CMA pages are added to the managed page count of a zone when
cma_init_reserved_areas() is called at boot also as a core_initcall.  This
makes it uncertain whether the CMA pages will be added to the managed page
counts of their zones before or after the call to
init_per_zone_wmark_min() as it becomes dependent on link order.  With the
current link order the pages are added to the managed count after the
lowmem_reserve arrays are initialized at boot.

This means the lowmem_reserve values at boot may be lower than the values
used later if /proc/sys/vm/lowmem_reserve_ratio is accessed even if the
ratio values are unchanged.

In many cases the difference is not significant, but for example
an ARM platform with 1GB of memory and the following memory layout

  cma: Reserved 256 MiB at 0x0000000030000000
  Zone ranges:
    DMA      [mem 0x0000000000000000-0x000000002fffffff]
    Normal   empty
    HighMem  [mem 0x0000000030000000-0x000000003fffffff]

would result in 0 lowmem_reserve for the DMA zone.  This would allow
userspace to deplete the DMA zone easily.

Funnily enough

  $ cat /proc/sys/vm/lowmem_reserve_ratio

would fix up the situation because as a side effect it forces
setup_per_zone_lowmem_reserve.

This commit breaks the link order dependency by invoking
init_per_zone_wmark_min() as a postcore_initcall so that the CMA pages
have the chance to be properly accounted in their zone(s) and allowing
the lowmem_reserve arrays to receive consistent values.

Fixes: bc22af74f271 ("mm: update min_free_kbytes from khugepaged after core initialization")
Signed-off-by: Doug Berger <opendmb@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/1597423766-27849-1-git-send-email-opendmb@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[sw0312.kim: cherry-pick linux-5.4.y commit 5663159e2930 to resolve too larget cma issue]
Signed-off-by: Seung-Woo Kim <sw0312.kim@samsung.com>
Change-Id: I59cbe93ed28f09d08dced520667b32e2fb030c21

ARM: dts: bcm2711-rpi-4-b: Fix CMA size to 512M

Increase cma size to 512 for use 4k GL on tizen platform.

When using RPI4 1GB model + 4k UHD, reclaim occurs a lot and it is slow.

Change-Id: Iac60a9fa0d5f76085e60cbf3db9a90585597de57
Signed-off-by: Hoegeun Kwon <hoegeun.kwon@samsung.com>

Revert "ALSA: usb-audio: Improve frames size computation"

This reverts commit aba41867dd66939d336fdf604e4d73b805d8039f which is
commit f0bd62b64016508938df9babe47f65c2c727d25c upstream.

It causes a number of reported issues and a fix for it has not hit
Linus's tree yet. Revert this to resolve those problems.

Cc: Alexander Tsoy <alexander@tsoy.me>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Hans de Goede <jwrdegoede@fedoraproject.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[sw0312.kim: cherry-pick linux-5.4.y commit e0ed5a36fb3a to fix specific usb-audio distortion issue]
Signed-off-by: Seung-Woo Kim <sw0312.kim@samsung.com>
Change-Id: I9e7ca39dddc28d8c29bf0488419c1fbdbddfb61a

script: Fix to use NCPUS dynamically

It uses NCPUS dynamically.

Change-Id: Ib7b1229f387b9c91f4eb2b29cc8f3bfe207fcaaa
Signed-off-by: Hoegeun Kwon <hoegeun.kwon@gmail.com>

staging: vc04_services: codec: Fix component enable/disable

start_streaming enabled the VPU component if ctx->component_enabled
was not set.
stop_streaming disabled the VPU component if both ports were
disabled. It didn't clear ctx->component_enabled.

If seeking, this meant that the component never got re-enabled,
and buffers never got processed afterwards.

Signed-off-by: Dave Stevenson <dave.stevenson@raspberrypi.com>
[sw0312.kim: cherry-pick rpi-5.4.y commit to fix video decoding seek issue]
Ref: https://github.com/raspberrypi/linux/pull/3790
Signed-off-by: Seung-Woo Kim <sw0312.kim@samsung.com>
Change-Id: Ibf27e3ed2dc3d98d64a78b10878d6339d77f8a33

staging: vc04_service: codec: Allow start_streaming to update the buffernum

start_streaming passes a count of how many buffers have been queued
to videobuf2.

Allow this value to update the number of buffers the VPU allocates
on a port to avoid buffer recycling issues.

Signed-off-by: Dave Stevenson <dave.stevenson@raspberrypi.com>
[sw0312.kim: cherry-pick rpi-5.4.y commit to fix video decoding seek issue]
Ref: https://github.com/raspberrypi/linux/pull/3790
Signed-off-by: Seung-Woo Kim <sw0312.kim@samsung.com>
Change-Id: Iaf37f57df95789cc8987b268feab1f0fa7377849

staging: vc04_services: codec: Fix incorrect buffer cleanup

The allocated input and output buffers are initialised in
buf_init and should only be cleared up in buf_cleanup.
stop_streaming was (incorrectly) cleaning up the buffers to
avoid an issue in videobuf2 that had been fixed by the orphaned
buffer support.

Remove the erroneous cleanup.

Signed-off-by: Dave Stevenson <dave.stevenson@raspberrypi.com>
[sw0312.kim: cherry-pick rpi-5.4.y commit to fix video decoding seek issue]
Ref: https://github.com/raspberrypi/linux/pull/3790
Signed-off-by: Seung-Woo Kim <sw0312.kim@samsung.com>
Change-Id: I38e3a257a941753b02e448d6d5dd1c8486b648f1

ARM: tizen_bcm2711_defconfig: Fix console loglevel for logo display

Modifiy CONSOLE_LOGLEVEL_QUIET defconfig to 3 for logo display.

Change-Id: Idae234c40304fbe8fd5f02941e53505a6bd189fa
Signed-off-by: Hoegeun Kwon <hoegeun.kwon@samsung.com>

arm64: tizen_bcm2711_defconfig: Fix console loglevel for logo display

Modifiy CONSOLE_LOGLEVEL_QUIET defconfig to 3 for logo display.

Change-Id: I5112d85dedd1e116a75af0d07c09d77182ff4f5a
Signed-off-by: Hoegeun Kwon <hoegeun.kwon@samsung.com>

rpi4: boot: config: Enable hdmi_force_hotplug

There is a problem that the booting stops when booting without connecting
HDMI. To fix it we need to enable hdmi_force_hotplug.

Change-Id: I44e960a04f75c999ab69a75ad72d67311f8b70c1
Signed-off-by: Hoegeun Kwon <hoegeun.kwon@samsung.com>

drm/vc4: kms: Don't disable the muxing of an active CRTC

The current HVS muxing code will consider the CRTCs in a given state to
setup their muxing in the HVS, and disable the other CRTCs muxes.

However, it's valid to only update a single CRTC with a state, and in this
situation we would mux out a CRTC that was enabled but left untouched by
the new state.

Fix this by setting a flag on the CRTC state when the muxing has been
changed, and only change the muxing configuration when that flag is there.

Fixes: 87ebcd42fb7b ("drm/vc4: crtc: Assign output to channel automatically")
Signed-off-by: Maxime Ripard <maxime@cerno.tech>
[hoegeun.kwon: Needed to fix page flip issue of dual hdmi and for enable
force hotplug configure. ]
Signed-off-by: Hoegeun Kwon <hoegeun.kwon@samsung.com>
Change-Id: I5c9fb79c9ad07caa8fe77ea724f35c955df6edcf

drm/vc4: kms: Store the unassigned channel list in the state

If a CRTC is enabled but not active, and that we're then doing a page
flip on another CRTC, drm_atomic_get_crtc_state will bring the first
CRTC state into the global state, and will make us wait for its vblank
as well, even though that might never occur.

Instead of creating the list of the free channels each time atomic_check
is called, and calling drm_atomic_get_crtc_state to retrieve the
allocated channels, let's create a private state object in the main
atomic state, and use it to store the available channels.

Fixes: 87ebcd42fb7b ("drm/vc4: crtc: Assign output to channel automatically")
Signed-off-by: Maxime Ripard <maxime@cerno.tech>
[hoegeun.kwon: Needed to fix page flip issue of dual hdmi and for enable
force hotplug configure. fix to unsigned long unassigned_channels instead
of integer to fix build errors.]
Signed-off-by: Hoegeun Kwon <hoegeun.kwon@samsung.com>
Change-Id: Ic8bf07502f614595070e21ecc7912e671a193b09

drm/vc4: kms: Add functions to create the state objects

We're going to add a new private state, so let's move the previous state
function creation to some functions to make further additions easier.

Signed-off-by: Maxime Ripard <maxime@cerno.tech>
[hoegeun.kwon: Needed to fix page flip issue of dual hdmi and for enable
force hotplug configure.]
Signed-off-by: Hoegeun Kwon <hoegeun.kwon@samsung.com>
Change-Id: Ice57c56dfbdfbb81f486cba7626512cfbb12b905

drm/vc4: kms: Rename NUM_CHANNELS

The NUM_CHANNELS define has a pretty generic name and was right before the
function using it. Let's move to something that makes the hardware-specific
nature more obvious, and to a more appropriate place.

Signed-off-by: Maxime Ripard <maxime@cerno.tech>
[hoegeun.kwon: Needed to fix page flip issue of dual hdmi and for enable
force hotplug configure.]
Signed-off-by: Hoegeun Kwon <hoegeun.kwon@samsung.com>
Change-Id: Ib94c2b3ffbc7e5c95f2c3a965919e523c68f8a12

drm/vc4: kms: Remove useless define

NUM_OUTPUTS isn't used anymore, let's remove it.

Signed-off-by: Maxime Ripard <maxime@cerno.tech>
[hoegeun.kwon: Needed to fix page flip issue of dual hdmi and for enable
force hotplug configure.]
Signed-off-by: Hoegeun Kwon <hoegeun.kwon@samsung.com>
Change-Id: I20b47a2da045094a06cef9e9a924f46c4dd1d4a9

Revert "drm/vc4: kms: Don't disable the muxing of an active CRTC"

This reverts commit 5d2fec61a25bacc49ee8e84b3c19aee1522f7289.

Revert this patch for apply patch version 2.

Change-Id: Ib7ff9b242dc3f208cf58f7d8f0741321b725c657
Signed-off-by: Hoegeun Kwon <hoegeun.kwon@samsung.com>

Revert "drm/vc4: kms: Fix VBLANK reporting on a disabled CRTC"

This reverts commit e805316d5d44b1f1f080fd8ae8a34b69329d940c.

Revert this patch for apply patch version 2.

Change-Id: If9525aaa3835ab80b8ca83271dc584f68254018a
Signed-off-by: Hoegeun Kwon <hoegeun.kwon@samsung.com>

mm: LKSM: bug fix for KASAN out-of-bound access error on accessing a filter

KASAN reports out-of-bound accesses (reported by Jaehoon Chung)
on slab which is performed for obtaining a next filtered
address to find a sharable page.

LKSM exploits bitmap-based filters to find sharable pages in
an efficient way. A buggy code is a kind of miscalculation for
boundary of the allocated bitmap. This patch takes care of it.

Change-Id: If45c5ce175db067523b60f11e69e12d2bc798659
Signed-off-by: Sung-hun Kim <sfoon.kim@samsung.com>

mm: LKSM: bug fix for kernel memory leak

For efficiency, LKSM cleans exited processes in a batched manner when it
finishes a scanning iteration. When it finds exited process while it is in
the scanning iteration, it just pends the mm_slot of the exited process to
the internal list.

On the other hend, when KSM daemon cleans mm_slots of exited processes, it
should care regions of exited processes to remove unreferenced lksm_region
objects.

Previously, most regions are maintained properly but only regions in "head"
of the exited process list does not be cleaned due to the buggy implementation.
At last, uncleaned objects are remained as unreferenced garbages.

Follow message is detected by kmemleak (reported by Suengwoo Kim):
=========================================================================
unreferenced object 0xffffff80c7083600 (size 128):
  comm "ksm_crawld", pid 41, jiffies 4294918362 (age 95.632s)
  hex dump (first 32 bytes):
    00 37 08 c7 80 ff ff ff 60 82 19 bd 80 ff ff ff  .7......`.......
    00 35 08 c7 80 ff ff ff 00 00 00 00 00 00 00 00  .5..............
  backtrace:
    [<0000000048313958>] kmem_cache_alloc_trace+0x1e0/0x348
    [<00000000fd246822>] lksm_region_ref_append+0x48/0xf8
    [<00000000c5a818a0>] ksm_join+0x3a0/0x498
    [<00000000b2c3f36a>] lksm_prepare_full_scan+0xe8/0x390
    [<00000000013943b5>] lksm_crawl_thread+0x214/0xbf8
    [<00000000b4ce0593>] kthread+0x1b0/0x1b8
    [<000000002a3f7216>] ret_from_fork+0x10/0x18
unreferenced object 0xffffff80c7083700 (size 128):
  comm "ksm_crawld", pid 41, jiffies 4294918362 (age 95.632s)
  hex dump (first 32 bytes):
    00 39 08 c7 80 ff ff ff 00 36 08 c7 80 ff ff ff  .9.......6......
    00 35 08 c7 80 ff ff ff 00 00 00 00 00 00 00 00  .5..............
  backtrace:
    [<0000000048313958>] kmem_cache_alloc_trace+0x1e0/0x348
    [<00000000fd246822>] lksm_region_ref_append+0x48/0xf8
    [<00000000c5a818a0>] ksm_join+0x3a0/0x498
    [<00000000b2c3f36a>] lksm_prepare_full_scan+0xe8/0x390
    [<00000000013943b5>] lksm_crawl_thread+0x214/0xbf8
    [<00000000b4ce0593>] kthread+0x1b0/0x1b8
    [<000000002a3f7216>] ret_from_fork+0x10/0x18
...
=========================================================================

This patch takes care of such possible kernel memory leak problem.

Change-Id: Ifb4963773b8803da239a1d3108c5b2690d22d531
Signed-off-by: Sung-hun Kim <sfoon.kim@samsung.com>

brcmfmac: Fix memory leak for unpaired brcmf_{alloc/free}

There are missig brcmf_free() for brcmf_alloc(). Fix memory leak
by adding missed brcmf_free().

Change-Id: I050398a7b828b0fb2aadbe491f353b3d3c47bdd6
Reported-by: Jaehoon Chung <jh80.chung@samsung.com>
Signed-off-by: Seung-Woo Kim <sw0312.kim@samsung.com>

drm/vc4: drv: Add error handding for bind

There is a problem that if vc4_drm bind fails, a memory leak occurs on
the drm_property_create side. Add error handding for drm_mode_config.

Change-Id: If23c0dabf01b85368871981f0cba427982922615
Signed-off-by: Hoegeun Kwon <hoegeun.kwon@samsung.com>

staging: mmal-vchiq: Fix memory leak for vchi_instance

The vchi_instance is allocated with vchiq_initialise() but never
handled properly. Fix memory leak for the vchi_instance.

This fixes below kmemleak report:
    [<000000002312557f>] kmem_cache_alloc_trace+0x1e0/0x348
    [<000000002ee5d470>] vchiq_initialise+0xd0/0x258
    [<000000009b51d8f4>] vchi_initialise+0x84/0xc8
    [<00000000a58f68c5>] vchiq_mmal_init+0xb0/0x2f0
    [<000000006c68d7bf>] bcm2835_mmal_probe+0x90/0x660
    ...

Change-Id: Ib10f194aa4058b3e39ec0a250fa36fd00daf9feb
Reported-by: Jaehoon Chung <jh80.chung@samsung.com>
Signed-off-by: Seung-Woo Kim <sw0312.kim@samsung.com>

drm/vc4: kms: Fix VBLANK reporting on a disabled CRTC

If a CRTC is enabled but not active, and that we're then doing a page flip
on another CRTC, drm_atomic_get_crtc_state will bring the first CRTC state
into the global state, and will make us wait for its vblank as well, even
though that might never occur.

Fix this by considering all the enabled CRTCs by either using their new
state in the global state, or using their current state if they aren't part
of the new state being checked, to remove their assigned channel from the
pool before started to assign channels to CRTCs enabled by the state.

Fixes: 87ebcd42fb7b ("drm/vc4: crtc: Assign output to channel automatically")
Signed-off-by: Maxime Ripard <maxime@cerno.tech>
[hoegeun.kwon: Needed to fix page flip issue of dual hdmi.]
Signed-off-by: Hoegeun Kwon <hoegeun.kwon@samsung.com>
Change-Id: Ic8b8083a652df142a17ec0d50f6c2572494ba3c0

drm/vc4: kms: Don't disable the muxing of an active CRTC

The current HVS muxing code will consider the CRTCs in a given state to
setup their muxing in the HVS, and disable the other CRTCs muxes.

However, it's valid to only update a single CRTC with a state, and in this
situation we would mux out a CRTC that was enabled but left untouched by
the new state.

Fix this by considering all the CRTCs in the state and the ones with a
state and active.

Fixes: 87ebcd42fb7b ("drm/vc4: crtc: Assign output to channel automatically")
Signed-off-by: Maxime Ripard <maxime@cerno.tech>
[hoegeun.kwon: Needed to fix page flip issue of dual hdmi.]
Signed-off-by: Hoegeun Kwon <hoegeun.kwon@samsung.com>
Change-Id: Idb6e79214f52e470e9fcf5a0579873402c631eb2

drm/vc4: kms: Fix to split vc4 and vc5 hvs pv muxing commit

In order to use dual HDMI normally, vc4 and vc5 should be separated.
Split vc4 hvs pv muxing commit and vc5 referring to patch[1].

[1] drm/vc4: crtc: Assign output to channel automatically
https://cgit.freedesktop.org/drm/drm-misc/commit/?id=87ebcd42fb7b8d1d3269007a621e41ae96a0077e

Change-Id: I5427cfd885c4643f764f8f84bed0f017c5d2a562
Signed-off-by: Hoegeun Kwon <hoegeun.kwon@samsung.com>

drm/vc4: kms: Document the muxing corner cases

We've had a number of muxing corner-cases with specific ways to reproduce
them, so let's document them to make sure they aren't lost and introduce
regressions later on.

Signed-off-by: Maxime Ripard <maxime@cerno.tech>
[hoegeun.kwon: Needed to fix page flip issue of dual hdmi.]
Signed-off-by: Hoegeun Kwon <hoegeun.kwon@samsung.com>
Change-Id: Icc4f82cd0c22f7b0b7e3c33be37d2e9d287638e4

drm/vc4: kms: Split the HVS muxing check in a separate function

The code that assigns HVS channels during atomic_check is starting to grow
a bit big, let's move it into a separate function.

Signed-off-by: Maxime Ripard <maxime@cerno.tech>
[hoegeun.kwon: Needed to fix page flip issue of dual hdmi.]
Signed-off-by: Hoegeun Kwon <hoegeun.kwon@samsung.com>
Change-Id: Icf636a846282981d21ad3e6652a3ec51f14993a4

drm/vc4: crtc: Keep the previously assigned HVS FIFO

The HVS FIFOs are currently assigned each time we have an atomic_check
for all the enabled CRTCs.

However, if we are running multiple outputs in parallel and we happen to
disable the first (by index) CRTC, we end up changing the assigned FIFO
of the second CRTC without disabling and reenabling the pixelvalve which
ends up in a stall and eventually a VBLANK timeout.

In order to fix this, we can create a special value for our assigned
channel to mark it as disabled, and if our CRTC already had an assigned
channel in its previous state, we keep on using it.

Fixes: 87ebcd42fb7b ("drm/vc4: crtc: Assign output to channel automatically")
Signed-off-by: Maxime Ripard <maxime@cerno.tech>
[hoegeun.kwon: Needed to fix page flip issue of dual hdmi.]
Signed-off-by: Hoegeun Kwon <hoegeun.kwon@samsung.com>
Change-Id: I70e0d6b563d5c7f22a1a7e1527a401ac42395b78

drm/vc4: kms: Assign a FIFO to enabled CRTCs instead of active

The HVS has three FIFOs that can be assigned to a number of PixelValves
through a mux.

However, changing that FIFO requires that we disable and then enable the
pixelvalve, so we want to assign FIFOs to all the enabled CRTCs, and not
just the active ones.

Fixes: 87ebcd42fb7b ("drm/vc4: crtc: Assign output to channel automatically")
Signed-off-by: Maxime Ripard <maxime@cerno.tech>
[hoegeun.kwon: Needed to fix page flip issue of dual hdmi.]
Signed-off-by: Hoegeun Kwon <hoegeun.kwon@samsung.com>
Change-Id: Id418d321e3cfc9b992bd1868e8084d9f8694a46e

brcmfmac: change from brcmf_dbg to brcmf_info

Change from brcmf_dbg to brcmf_info.
This patch is workaround to check debug message on Testhub.

Change-Id: I7fef8f64fc38e43b6544fab3e9474e7f1829cc60
Signed-off-by: Jaehoon Chung <jh80.chung@samsung.com>

drm/vc4: hvs: Pull the state of all the CRTCs prior to PV muxing

The vc4 display engine has a first controller called the HVS that will
perform the composition of the planes. That HVS has 3 FIFOs and can
therefore compose planes for up to three outputs. The timings part is
generated through a component called the Pixel Valve, and the BCM2711 has 6
of them.

Thus, the HVS has some bits to control which FIFO gets output to which
Pixel Valve. The current code supports that muxing by looking at all the
CRTCs in a new DRM atomic state in atomic_check, and given the set of
constraints that we have, assigns FIFOs to CRTCs or reject the mode
entirely. The actual muxing will occur during atomic_commit.

However, that doesn't work if only a fraction of the CRTCs' state is
updated in that state, since it will ignore the CRTCs that are kept running
unmodified, and will thus unassign its associated FIFO, and later disable
it.

In order to make the code work as expected, let's pull the CRTC state of
all the enabled CRTC in our atomic_check so that we can operate on all the
running CRTCs, no matter whether they are affected by the new state or not.

Fixes: 87ebcd42fb7b ("drm/vc4: crtc: Assign output to channel automatically")
Signed-off-by: Maxime Ripard <maxime@cerno.tech>
Tested-by: Hoegeun Kwon <hoegeun.kwon@samsung.com>
Tested-by: Dave Stevenson <dave.stevenson@raspberrypi.com>
Reviewed-by: Dave Stevenson <dave.stevenson@raspberrypi.com>
Reviewed-by: Hoegeun Kwon <hoegeun.kwon@samsung.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200917121623.42023-1-maxime@cerno.tech
[hoegeun.kwon: Fix dual hdmi issue]
Signed-off-by: Hoegeun Kwon <hoegeun.kwon@samsung.com>
Change-Id: I750e253c7415b09e546beb2608445d581c441efe

drm/vc4: crtc: Add null pointer error handling

Crash occurs when vc4_encoder is null. Add null error handling

  Unable to handle kernel NULL pointer dereference at virtual address 0000000000000090
  Mem abort info:
    ESR = 0x96000005
    EC = 0x25: DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
  Data abort info:
    ISV = 0, ISS = 0x00000005
    CM = 0, WnR = 0
  user pgtable: 4k pages, 39-bit VAs, pgdp=00000000f2a96000
  [0000000000000090] pgd=0000000000000000, pud=0000000000000000
  Internal error: Oops: 96000005 [#1] PREEMPT SMP
  Modules linked in: brcmfmac joydev brcmutil rpivid_mem
  CPU: 0 PID: 32 Comm: kworker/0:1 Not tainted 5.4.50-v8+ #414
  Hardware name: Raspberry Pi 4 Model B (DT)
  Workqueue: events output_poll_execute
  pstate: 80000005 (Nzcv daif -PAN -UAO)
  pc : vc4_crtc_atomic_disable+0x150/0x2e8
  lr : vc4_crtc_atomic_disable+0x50/0x2e8
  ...
  Call trace:
   vc4_crtc_atomic_disable+0x150/0x2e8
   drm_atomic_helper_commit_modeset_disables+0x344/0x410
   vc4_atomic_complete_commit+0xd4/0x530
   vc4_atomic_commit+0xf4/0x190
   drm_atomic_commit+0x54/0x60
   ...

Change-Id: Ib3271bb3c8318079fe5815711f89a217e835adc6
Signed-off-by: Hoegeun Kwon <hoegeun.kwon@samsung.com>

scripts: mkbootimg_rpi4.sh: use compress image

Use compressed image for less size.

Change-Id: I9ec8b7e5e35bfa8ae6ac2832d52d4748e9a79668
Signed-off-by: Seung-Woo Kim <sw0312.kim@samsung.com>

drm/vc4: hvs: Boost the core clock during modeset

In order to prevent timeouts and stalls in the pipeline, the core clock
needs to be maxed at 500MHz during a modeset on the BCM2711.

Signed-off-by: Maxime Ripard <maxime@cerno.tech>
Tested-by: Chanwoo Choi <cw00.choi@samsung.com>
Tested-by: Hoegeun Kwon <hoegeun.kwon@samsung.com>
Tested-by: Stefan Wahren <stefan.wahren@i2se.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
Link: https://patchwork.freedesktop.org/patch/msgid/37ed9e0124c5cce005ddc8dafe821d8b0da036ff.1599120059.git-series.maxime@cerno.tech
[hoegeun.kwon: A screen cracking problem occurs in FHD(1920x1080). The
cause is that the clock should be kept at 500MHz, but it occurred when
the clock fell to 200MHz, so apply a patch.]
Signed-off-by: Hoegeun Kwon <hoegeun.kwon@samsung.com>
Change-Id: Ie0d4880966c1644bfdf56bb49dbb82559978538c