Patch series "mm: memcontrol: switch to rstat", v3.
This series converts memcg stats tracking to the streamlined rstat
infrastructure provided by the cgroup core code. rstat is already used by
the CPU controller and the IO controller. This change is motivated by
recent accuracy problems in memcg's custom stats code, as well as the
benefits of sharing common infra with other controllers.
The current memcg implementation does batched tree aggregation on the
write side: local stat changes are cached in per-cpu counters, which are
then propagated upward in batches when a threshold (32 pages) is exceeded.
This is cheap, but the error introduced by the lazy upward propagation
adds up: 32 pages times CPUs times cgroups in the subtree. We've had
complaints from service owners that the stats do not reliably track and
react to allocation behavior as expected, sometimes swallowing the results
of entire test applications.
The original memcg stat implementation used to do tree aggregation
exclusively on the read side: local stats would only ever be tracked in
per-cpu counters, and a memory.stat read would iterate the entire subtree
and sum those counters up. This didn't keep up with the times:
- Cgroup trees are much bigger now. We switched to lazily-freed
cgroups, where deleted groups would hang around until their remaining
page cache has been reclaimed. This can result in large subtrees that
are expensive to walk, while most of the groups are idle and their
statistics don't change much anymore.
- Automated monitoring increased. With the proliferation of userspace
oom killing, proactive reclaim, and higher-resolution logging of
workload trends in general, top-level stat files are polled at least
once a second in many deployments.
- The lifetime of cgroups got shorter. Where most cgroup setups in the
past would have a few large policy-oriented cgroups for everything
running on the system, newer cgroup deployments tend to create one
group per application - which gets deleted again as the processes
exit. An aggregation scheme that doesn't retain child data inside the
parents loses event history of the subtree.
Rstat addresses all three of those concerns through intelligent,
persistent read-side aggregation. As statistics change at the local
level, rstat tracks - on a per-cpu basis - only those parts of a subtree
that have changes pending and require aggregation. The actual
aggregation occurs on the colder read side - which can now skip over
(potentially large) numbers of recently idle cgroups.
===
The test_kmem cgroup selftest is currently failing due to excessive
cumulative vmstat drift from 100 subgroups:
ok 1 test_kmem_basic
memory.current =
8810496
slab + anon + file + kernel_stack =
17074568
slab =
6101384
anon = 946176
file = 0
kernel_stack =
10027008
not ok 2 test_kmem_memcg_deletion
ok 3 test_kmem_proc_kpagecgroup
ok 4 test_kmem_kernel_stacks
ok 5 test_kmem_dead_cgroups
ok 6 test_percpu_basic
As you can see, memory.stat items far exceed memory.current. The kernel
stack alone is bigger than all of charged memory. That's because the
memory of the test has been uncharged from memory.current, but the
negative vmstat deltas are still sitting in the percpu caches.
The test at this time isn't even counting percpu, pagetables etc. yet,
which would further contribute to the error. The last patch in the series
updates the test to include them - as well as reduces the vmstat
tolerances in general to only expect page_counter batching.
With all patches applied, the (now more stringent) test succeeds:
ok 1 test_kmem_basic
ok 2 test_kmem_memcg_deletion
ok 3 test_kmem_proc_kpagecgroup
ok 4 test_kmem_kernel_stacks
ok 5 test_kmem_dead_cgroups
ok 6 test_percpu_basic
===
A kernel build test confirms that overhead is comparable. Two kernels are
built simultaneously in a nested tree with several idle siblings:
root - kernelbuild - one - two - three - four - build-a (defconfig, make -j16)
`- build-b (defconfig, make -j16)
`- idle-1
`- ...
`- idle-9
During the builds, kernelbuild/memory.stat is read once a second.
A perf diff shows that the changes in cycle distribution is
minimal. Top 10 kernel symbols:
0.09% +0.08% [kernel.kallsyms] [k] __mod_memcg_lruvec_state
0.00% +0.06% [kernel.kallsyms] [k] cgroup_rstat_updated
0.08% -0.05% [kernel.kallsyms] [k] __mod_memcg_state.part.0
0.16% -0.04% [kernel.kallsyms] [k] release_pages
0.00% +0.03% [kernel.kallsyms] [k] __count_memcg_events
0.01% +0.03% [kernel.kallsyms] [k] mem_cgroup_charge_statistics.constprop.0
0.10% -0.02% [kernel.kallsyms] [k] get_mem_cgroup_from_mm
0.05% -0.02% [kernel.kallsyms] [k] mem_cgroup_update_lru_size
0.57% +0.01% [kernel.kallsyms] [k] asm_exc_page_fault
===
The on-demand aggregated stats are now fully accurate:
$ grep -e nr_inactive_file /proc/vmstat | awk '{print($1,$2*4096)}'; \
grep -e inactive_file /sys/fs/cgroup/memory.stat
vanilla: patched:
nr_inactive_file
1574105088 nr_inactive_file
1027801088
inactive_file
1577410560 inactive_file
1027801088
===
This patch (of 8):
The memcg hotunplug callback erroneously flushes counts on the local CPU,
not the counts of the CPU going away; those counts will be lost.
Flush the CPU that is actually going away.
Also simplify the code a bit by using mod_memcg_state() and
count_memcg_events() instead of open-coding the upward flush - this is
comparable to how vmstat.c handles hotunplug flushing.
Link: https://lkml.kernel.org/r/20210209163304.77088-1-hannes@cmpxchg.org
Link: https://lkml.kernel.org/r/20210209163304.77088-2-hannes@cmpxchg.org
Fixes: a983b5ebee572 ("mm: memcontrol: fix excessive complexity in memory.stat reporting")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>