+.. SPDX-License-Identifier: GPL-2.0
+
+=======
+The TLB
+=======
+
When the kernel unmaps or modified the attributes of a range of
memory, it has two choices:
+
1. Flush the entire TLB with a two-instruction sequence. This is
a quick operation, but it causes collateral damage: TLB entries
from areas other than the one we are trying to flush will be
damage to other TLB entries.
Which method to do depends on a few things:
+
1. The size of the flush being performed. A flush of the entire
address space is obviously better performed by flushing the
entire TLB than doing 2^48/PAGE_SIZE individual flushes.
You may be doing too many individual invalidations if you see the
invlpg instruction (or instructions _near_ it) show up high in
profiles. If you believe that individual invalidations being
-called too often, you can lower the tunable:
+called too often, you can lower the tunable::
/sys/kernel/debug/x86/tlb_single_page_flush_ceiling
never need to be 0 under normal circumstances.
Despite the fact that a single individual flush on x86 is
-guaranteed to flush a full 2MB [1], hugetlbfs always uses the full
+guaranteed to flush a full 2MB [1]_, hugetlbfs always uses the full
flushes. THP is treated exactly the same as normal memory.
You might see invlpg inside of flush_tlb_mm_range() show up in
with the cycles that you spend refilling the TLB later.
You can measure how expensive TLB refills are by using
-performance counters and 'perf stat', like this:
+performance counters and 'perf stat', like this::
-perf stat -e
- cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/,
- cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/,
- cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/,
- cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/,
- cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/,
- cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/
+ perf stat -e
+ cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/,
+ cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/,
+ cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/,
+ cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/,
+ cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/,
+ cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/
That works on an IvyBridge-era CPU (i5-3320M). Different CPUs
may have differently-named counters, but they should at least
(https://github.com/andikleen/pmu-tools) to find the right
counters for a given CPU.
-1. A footnote in Intel's SDM "4.10.4.2 Recommended Invalidation"
+.. [1] A footnote in Intel's SDM "4.10.4.2 Recommended Invalidation"
says: "One execution of INVLPG is sufficient even for a page
with size greater than 4 KBytes."