1 .. SPDX-License-Identifier: GPL-2.0
3 .. _physical_memory_model:
9 Physical memory in a system may be addressed in different ways. The
10 simplest case is when the physical memory starts at address 0 and
11 spans a contiguous range up to the maximal address. It could be,
12 however, that this range contains small holes that are not accessible
13 for the CPU. Then there could be several contiguous ranges at
14 completely distinct addresses. And, don't forget about NUMA, where
15 different memory banks are attached to different CPUs.
17 Linux abstracts this diversity using one of the two memory models:
18 FLATMEM and SPARSEMEM. Each architecture defines what
19 memory models it supports, what the default memory model is and
20 whether it is possible to manually override that default.
22 All the memory models track the status of physical page frames using
23 struct page arranged in one or more arrays.
25 Regardless of the selected memory model, there exists one-to-one
26 mapping between the physical page frame number (PFN) and the
27 corresponding `struct page`.
29 Each memory model defines :c:func:`pfn_to_page` and :c:func:`page_to_pfn`
30 helpers that allow the conversion from PFN to `struct page` and vice
36 The simplest memory model is FLATMEM. This model is suitable for
37 non-NUMA systems with contiguous, or mostly contiguous, physical
40 In the FLATMEM memory model, there is a global `mem_map` array that
41 maps the entire physical memory. For most architectures, the holes
42 have entries in the `mem_map` array. The `struct page` objects
43 corresponding to the holes are never fully initialized.
45 To allocate the `mem_map` array, architecture specific setup code should
46 call :c:func:`free_area_init` function. Yet, the mappings array is not
47 usable until the call to :c:func:`memblock_free_all` that hands all the
48 memory to the page allocator.
50 An architecture may free parts of the `mem_map` array that do not cover the
51 actual physical pages. In such case, the architecture specific
52 :c:func:`pfn_valid` implementation should take the holes in the
53 `mem_map` into account.
55 With FLATMEM, the conversion between a PFN and the `struct page` is
56 straightforward: `PFN - ARCH_PFN_OFFSET` is an index to the
59 The `ARCH_PFN_OFFSET` defines the first page frame number for
60 systems with physical memory starting at address different from 0.
65 SPARSEMEM is the most versatile memory model available in Linux and it
66 is the only memory model that supports several advanced features such
67 as hot-plug and hot-remove of the physical memory, alternative memory
68 maps for non-volatile memory devices and deferred initialization of
69 the memory map for larger systems.
71 The SPARSEMEM model presents the physical memory as a collection of
72 sections. A section is represented with struct mem_section
73 that contains `section_mem_map` that is, logically, a pointer to an
74 array of struct pages. However, it is stored with some other magic
75 that aids the sections management. The section size and maximal number
76 of section is specified using `SECTION_SIZE_BITS` and
77 `MAX_PHYSMEM_BITS` constants defined by each architecture that
78 supports SPARSEMEM. While `MAX_PHYSMEM_BITS` is an actual width of a
79 physical address that an architecture supports, the
80 `SECTION_SIZE_BITS` is an arbitrary value.
82 The maximal number of sections is denoted `NR_MEM_SECTIONS` and
87 NR\_MEM\_SECTIONS = 2 ^ {(MAX\_PHYSMEM\_BITS - SECTION\_SIZE\_BITS)}
89 The `mem_section` objects are arranged in a two-dimensional array
90 called `mem_sections`. The size and placement of this array depend
91 on `CONFIG_SPARSEMEM_EXTREME` and the maximal possible number of
94 * When `CONFIG_SPARSEMEM_EXTREME` is disabled, the `mem_sections`
95 array is static and has `NR_MEM_SECTIONS` rows. Each row holds a
96 single `mem_section` object.
97 * When `CONFIG_SPARSEMEM_EXTREME` is enabled, the `mem_sections`
98 array is dynamically allocated. Each row contains PAGE_SIZE worth of
99 `mem_section` objects and the number of rows is calculated to fit
100 all the memory sections.
102 The architecture setup code should call sparse_init() to
103 initialize the memory sections and the memory maps.
105 With SPARSEMEM there are two possible ways to convert a PFN to the
106 corresponding `struct page` - a "classic sparse" and "sparse
107 vmemmap". The selection is made at build time and it is determined by
108 the value of `CONFIG_SPARSEMEM_VMEMMAP`.
110 The classic sparse encodes the section number of a page in page->flags
111 and uses high bits of a PFN to access the section that maps that page
112 frame. Inside a section, the PFN is the index to the array of pages.
114 The sparse vmemmap uses a virtually mapped memory map to optimize
115 pfn_to_page and page_to_pfn operations. There is a global `struct
116 page *vmemmap` pointer that points to a virtually contiguous array of
117 `struct page` objects. A PFN is an index to that array and the
118 offset of the `struct page` from `vmemmap` is the PFN of that
121 To use vmemmap, an architecture has to reserve a range of virtual
122 addresses that will map the physical pages containing the memory
123 map and make sure that `vmemmap` points to that range. In addition,
124 the architecture should implement :c:func:`vmemmap_populate` method
125 that will allocate the physical memory and create page tables for the
126 virtual memory map. If an architecture does not have any special
127 requirements for the vmemmap mappings, it can use default
128 :c:func:`vmemmap_populate_basepages` provided by the generic memory
131 The virtually mapped memory map allows storing `struct page` objects
132 for persistent memory devices in pre-allocated storage on those
133 devices. This storage is represented with struct vmem_altmap
134 that is eventually passed to vmemmap_populate() through a long chain
135 of function calls. The vmemmap_populate() implementation may use the
136 `vmem_altmap` along with :c:func:`vmemmap_alloc_block_buf` helper to
137 allocate memory map on the persistent memory device.
141 The `ZONE_DEVICE` facility builds upon `SPARSEMEM_VMEMMAP` to offer
142 `struct page` `mem_map` services for device driver identified physical
143 address ranges. The "device" aspect of `ZONE_DEVICE` relates to the fact
144 that the page objects for these address ranges are never marked online,
145 and that a reference must be taken against the device, not just the page
146 to keep the memory pinned for active use. `ZONE_DEVICE`, via
147 :c:func:`devm_memremap_pages`, performs just enough memory hotplug to
148 turn on :c:func:`pfn_to_page`, :c:func:`page_to_pfn`, and
149 :c:func:`get_user_pages` service for the given range of pfns. Since the
150 page reference count never drops below 1 the page is never tracked as
151 free memory and the page's `struct list_head lru` space is repurposed
152 for back referencing to the host device / driver that mapped the memory.
154 While `SPARSEMEM` presents memory as a collection of sections,
155 optionally collected into memory blocks, `ZONE_DEVICE` users have a need
156 for smaller granularity of populating the `mem_map`. Given that
157 `ZONE_DEVICE` memory is never marked online it is subsequently never
158 subject to its memory ranges being exposed through the sysfs memory
159 hotplug api on memory block boundaries. The implementation relies on
160 this lack of user-api constraint to allow sub-section sized memory
161 ranges to be specified to :c:func:`arch_add_memory`, the top-half of
162 memory hotplug. Sub-section support allows for 2MB as the cross-arch
163 common alignment granularity for :c:func:`devm_memremap_pages`.
165 The users of `ZONE_DEVICE` are:
167 * pmem: Map platform persistent memory to be used as a direct-I/O target
170 * hmm: Extend `ZONE_DEVICE` with `->page_fault()` and `->page_free()`
171 event callbacks to allow a device-driver to coordinate memory management
172 events related to device-memory, typically GPU memory. See
173 Documentation/vm/hmm.rst.
175 * p2pdma: Create `struct page` objects to allow peer devices in a
176 PCI/-E topology to coordinate direct-DMA operations between themselves,
177 i.e. bypass host memory.