Documentation/admin-guide/mm/multigen_lru.rst

   1 .. SPDX-License-Identifier: GPL-2.0
   2
   3 =============
   4 Multi-Gen LRU
   5 =============
   6 The multi-gen LRU is an alternative LRU implementation that optimizes
   7 page reclaim and improves performance under memory pressure. Page
   8 reclaim decides the kernel's caching policy and ability to overcommit
   9 memory. It directly impacts the kswapd CPU usage and RAM efficiency.
  10
  11 Quick start
  12 ===========
  13 Build the kernel with the following configurations.
  14
  15 * ``CONFIG_LRU_GEN=y``
  16 * ``CONFIG_LRU_GEN_ENABLED=y``
  17
  18 All set!
  19
  20 Runtime options
  21 ===============
  22 ``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
  23 following subsections.
  24
  25 Kill switch
  26 -----------
  27 ``enabled`` accepts different values to enable or disable the
  28 following components. Its default value depends on
  29 ``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
  30 unless some of them have unforeseen side effects. Writing to
  31 ``enabled`` has no effect when a component is not supported by the
  32 hardware, and valid values will be accepted even when the main switch
  33 is off.
  34
  35 ====== ===============================================================
  36 Values Components
  37 ====== ===============================================================
  38 0x0001 The main switch for the multi-gen LRU.
  39 0x0002 Clearing the accessed bit in leaf page table entries in large
  40        batches, when MMU sets it (e.g., on x86). This behavior can
  41        theoretically worsen lock contention (mmap_lock). If it is
  42        disabled, the multi-gen LRU will suffer a minor performance
  43        degradation for workloads that contiguously map hot pages,
  44        whose accessed bits can be otherwise cleared by fewer larger
  45        batches.
  46 0x0004 Clearing the accessed bit in non-leaf page table entries as
  47        well, when MMU sets it (e.g., on x86). This behavior was not
  48        verified on x86 varieties other than Intel and AMD. If it is
  49        disabled, the multi-gen LRU will suffer a negligible
  50        performance degradation.
  51 [yYnN] Apply to all the components above.
  52 ====== ===============================================================
  53
  54 E.g.,
  55 ::
  56
  57     echo y >/sys/kernel/mm/lru_gen/enabled
  58     cat /sys/kernel/mm/lru_gen/enabled
  59     0x0007
  60     echo 5 >/sys/kernel/mm/lru_gen/enabled
  61     cat /sys/kernel/mm/lru_gen/enabled
  62     0x0005
  63
  64 Thrashing prevention
  65 --------------------
  66 Personal computers are more sensitive to thrashing because it can
  67 cause janks (lags when rendering UI) and negatively impact user
  68 experience. The multi-gen LRU offers thrashing prevention to the
  69 majority of laptop and desktop users who do not have ``oomd``.
  70
  71 Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
  72 ``N`` milliseconds from getting evicted. The OOM killer is triggered
  73 if this working set cannot be kept in memory. In other words, this
  74 option works as an adjustable pressure relief valve, and when open, it
  75 terminates applications that are hopefully not being used.
  76
  77 Based on the average human detectable lag (~100ms), ``N=1000`` usually
  78 eliminates intolerable janks due to thrashing. Larger values like
  79 ``N=3000`` make janks less noticeable at the risk of premature OOM
  80 kills.
  81
  82 The default value ``0`` means disabled.
  83
  84 Experimental features
  85 =====================
  86 ``/sys/kernel/debug/lru_gen`` accepts commands described in the
  87 following subsections. Multiple command lines are supported, so does
  88 concatenation with delimiters ``,`` and ``;``.
  89
  90 ``/sys/kernel/debug/lru_gen_full`` provides additional stats for
  91 debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
  92 evicted generations in this file.
  93
  94 Working set estimation
  95 ----------------------
  96 Working set estimation measures how much memory an application needs
  97 in a given time interval, and it is usually done with little impact on
  98 the performance of the application. E.g., data centers want to
  99 optimize job scheduling (bin packing) to improve memory utilizations.
 100 When a new job comes in, the job scheduler needs to find out whether
 101 each server it manages can allocate a certain amount of memory for
 102 this new job before it can pick a candidate. To do so, the job
 103 scheduler needs to estimate the working sets of the existing jobs.
 104
 105 When it is read, ``lru_gen`` returns a histogram of numbers of pages
 106 accessed over different time intervals for each memcg and node.
 107 ``MAX_NR_GENS`` decides the number of bins for each histogram. The
 108 histograms are noncumulative.
 109 ::
 110
 111     memcg  memcg_id  memcg_path
 112        node  node_id
 113            min_gen_nr  age_in_ms  nr_anon_pages  nr_file_pages
 114            ...
 115            max_gen_nr  age_in_ms  nr_anon_pages  nr_file_pages
 116
 117 Each bin contains an estimated number of pages that have been accessed
 118 within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages
 119 and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of
 120 the former is the largest and that of the latter is the smallest.
 121
 122 Users can write the following command to ``lru_gen`` to create a new
 123 generation ``max_gen_nr+1``:
 124
 125     ``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]``
 126
 127 ``can_swap`` defaults to the swap setting and, if it is set to ``1``,
 128 it forces the scan of anon pages when swap is off, and vice versa.
 129 ``force_scan`` defaults to ``1`` and, if it is set to ``0``, it
 130 employs heuristics to reduce the overhead, which is likely to reduce
 131 the coverage as well.
 132
 133 A typical use case is that a job scheduler runs this command at a
 134 certain time interval to create new generations, and it ranks the
 135 servers it manages based on the sizes of their cold pages defined by
 136 this time interval.
 137
 138 Proactive reclaim
 139 -----------------
 140 Proactive reclaim induces page reclaim when there is no memory
 141 pressure. It usually targets cold pages only. E.g., when a new job
 142 comes in, the job scheduler wants to proactively reclaim cold pages on
 143 the server it selected, to improve the chance of successfully landing
 144 this new job.
 145
 146 Users can write the following command to ``lru_gen`` to evict
 147 generations less than or equal to ``min_gen_nr``.
 148
 149     ``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]``
 150
 151 ``min_gen_nr`` should be less than ``max_gen_nr-1``, since
 152 ``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to
 153 the active list) and therefore cannot be evicted. ``swappiness``
 154 overrides the default value in ``/proc/sys/vm/swappiness``.
 155 ``nr_to_reclaim`` limits the number of pages to evict.
 156
 157 A typical use case is that a job scheduler runs this command before it
 158 tries to land a new job on a server. If it fails to materialize enough
 159 cold pages because of the overestimation, it retries on the next
 160 server according to the ranking result obtained from the working set
 161 estimation step. This less forceful approach limits the impacts on the
 162 existing jobs.