mm/page_alloc: free pages in a single pass during bulk free
free_pcppages_bulk() has taken two passes through the pcp lists since
commit
0a5f4e5b4562 ("mm/free_pcppages_bulk: do not hold lock when
picking pages to free") due to deferring the cost of selecting PCP lists
until the zone lock is held. Now that list selection is simplier, the
main cost during selection is bulkfree_pcp_prepare() which in the normal
case is a simple check and prefetching. As the list manipulations have
cost in itself, go back to freeing pages in a single pass.
The series up to this point was evaulated using a trunc microbenchmark
that is truncating sparse files stored in page cache (mmtests config
config-io-trunc). Sparse files were used to limit filesystem
interaction. The results versus a revert of storing high-order pages in
the PCP lists is
1-socket Skylake
5.17.0-rc3 5.17.0-rc3 5.17.0-rc3
vanilla mm-reverthighpcp-v1 mm-highpcpopt-v2
Min elapsed 540.00 ( 0.00%) 530.00 ( 1.85%) 530.00 ( 1.85%)
Amean elapsed 543.00 ( 0.00%) 530.00 * 2.39%* 530.00 * 2.39%*
Stddev elapsed 4.83 ( 0.00%) 0.00 ( 100.00%) 0.00 ( 100.00%)
CoeffVar elapsed 0.89 ( 0.00%) 0.00 ( 100.00%) 0.00 ( 100.00%)
Max elapsed 550.00 ( 0.00%) 530.00 ( 3.64%) 530.00 ( 3.64%)
BAmean-50 elapsed 540.00 ( 0.00%) 530.00 ( 1.85%) 530.00 ( 1.85%)
BAmean-95 elapsed 542.22 ( 0.00%) 530.00 ( 2.25%) 530.00 ( 2.25%)
BAmean-99 elapsed 542.22 ( 0.00%) 530.00 ( 2.25%) 530.00 ( 2.25%)
2-socket CascadeLake
5.17.0-rc3 5.17.0-rc3 5.17.0-rc3
vanilla mm-reverthighpcp-v1 mm-highpcpopt-v2
Min elapsed 510.00 ( 0.00%) 500.00 ( 1.96%) 500.00 ( 1.96%)
Amean elapsed 529.00 ( 0.00%) 521.00 ( 1.51%) 510.00 * 3.59%*
Stddev elapsed 16.63 ( 0.00%) 12.87 ( 22.64%) 11.55 ( 30.58%)
CoeffVar elapsed 3.14 ( 0.00%) 2.47 ( 21.46%) 2.26 ( 27.99%)
Max elapsed 550.00 ( 0.00%) 540.00 ( 1.82%) 530.00 ( 3.64%)
BAmean-50 elapsed 516.00 ( 0.00%) 512.00 ( 0.78%) 500.00 ( 3.10%)
BAmean-95 elapsed 526.67 ( 0.00%) 518.89 ( 1.48%) 507.78 ( 3.59%)
BAmean-99 elapsed 526.67 ( 0.00%) 518.89 ( 1.48%) 507.78 ( 3.59%)
The original motivation for multi-passes was will-it-scale page_fault1
using $nr_cpu processes.
2-socket CascadeLake (40 cores, 80 CPUs HT enabled)
5.17.0-rc3 5.17.0-rc3
vanilla mm-highpcpopt-v2
Hmean page_fault1-processes-2
2694662.26 ( 0.00%)
2695780.35 ( 0.04%)
Hmean page_fault1-processes-5
6425819.34 ( 0.00%)
6435544.57 * 0.15%*
Hmean page_fault1-processes-8
9642169.10 ( 0.00%)
9658962.39 ( 0.17%)
Hmean page_fault1-processes-12
12167502.10 ( 0.00%)
12190163.79 ( 0.19%)
Hmean page_fault1-processes-21
15636859.03 ( 0.00%)
15612447.26 ( -0.16%)
Hmean page_fault1-processes-30
25157348.61 ( 0.00%)
25169456.65 ( 0.05%)
Hmean page_fault1-processes-48
27694013.85 ( 0.00%)
27671111.46 ( -0.08%)
Hmean page_fault1-processes-79
25928742.64 ( 0.00%)
25934202.02 ( 0.02%) <--
Hmean page_fault1-processes-110
25730869.75 ( 0.00%)
25671880.65 * -0.23%*
Hmean page_fault1-processes-141
25626992.42 ( 0.00%)
25629551.61 ( 0.01%)
Hmean page_fault1-processes-172
25611651.35 ( 0.00%)
25614927.99 ( 0.01%)
Hmean page_fault1-processes-203
25577298.75 ( 0.00%)
25583445.59 ( 0.02%)
Hmean page_fault1-processes-234
25580686.07 ( 0.00%)
25608240.71 ( 0.11%)
Hmean page_fault1-processes-265
25570215.47 ( 0.00%)
25568647.58 ( -0.01%)
Hmean page_fault1-processes-296
25549488.62 ( 0.00%)
25543935.00 ( -0.02%)
Hmean page_fault1-processes-320
25555149.05 ( 0.00%)
25575696.74 ( 0.08%)
The differences are mostly within the noise and the difference close to
$nr_cpus is negligible.
Link: https://lkml.kernel.org/r/20220217002227.5739-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Aaron Lu <aaron.lu@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>