Matthew Wilcox (Oracle) [Sat, 29 Jan 2022 15:46:04 +0000 (10:46 -0500)]
mm/damon: Convert damon_pa_young() to use a folio
Ensure that we're passing the entire folio to rmap_walk().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Matthew Wilcox (Oracle) [Sat, 29 Jan 2022 15:46:04 +0000 (10:46 -0500)]
mm/damon: Convert damon_pa_mkold() to use a folio
Ensure that we're passing the entire folio to rmap_walk().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Matthew Wilcox (Oracle) [Sat, 29 Jan 2022 04:32:59 +0000 (23:32 -0500)]
mm/migrate: Convert remove_migration_ptes() to folios
Convert the implementation and all callers.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Matthew Wilcox (Oracle) [Fri, 28 Jan 2022 21:03:42 +0000 (16:03 -0500)]
mm/rmap: Convert make_device_exclusive_range() to use folios
Move the PageTail check earlier so we can avoid even taking the folio
lock on tail pages. Otherwise, this is a straightforward use of
folios throughout.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Matthew Wilcox (Oracle) [Fri, 28 Jan 2022 19:29:43 +0000 (14:29 -0500)]
mm/rmap: Convert try_to_migrate() to folios
Convert the callers to pass a folio and the try_to_migrate_one()
worker to use a folio throughout. Fixes an assumption that a
folio must be <= PMD size.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Matthew Wilcox (Oracle) [Tue, 15 Feb 2022 14:28:49 +0000 (09:28 -0500)]
mm/rmap: Convert try_to_unmap() to take a folio
Change all three callers and the worker function try_to_unmap_one().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Matthew Wilcox (Oracle) [Fri, 21 Jan 2022 15:44:52 +0000 (10:44 -0500)]
mm/huge_memory: Convert __split_huge_pmd() to take a folio
Convert split_huge_pmd_address() at the same time since it only passes
the folio through, and its two callers already have a folio on hand.
Removes numerous calls to compound_head() and removes an assumption
that a page cannot be larger than a PMD.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Matthew Wilcox (Oracle) [Fri, 21 Jan 2022 16:27:31 +0000 (11:27 -0500)]
mm/rmap: Turn page_referenced() into folio_referenced()
Both its callers pass a page which was previously on an LRU list,
so were passing a folio by definition. Use the type system to enforce
that and remove a few calls to compound_head().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Matthew Wilcox (Oracle) [Tue, 15 Feb 2022 18:33:59 +0000 (13:33 -0500)]
mm/mlock: Add mlock_vma_folio()
Convert mlock_page() into mlock_folio() and convert the callers. Keep
mlock_vma_page() as a wrapper.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Matthew Wilcox (Oracle) [Thu, 20 Jan 2022 23:20:07 +0000 (18:20 -0500)]
mm/rmap: Use a folio in page_mkclean_one()
folio_mkclean() already passes down a head page, so convert it
back to a folio.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Matthew Wilcox (Oracle) [Sat, 29 Jan 2022 20:53:59 +0000 (15:53 -0500)]
mm/page_idle: Convert page_idle_clear_pte_refs() to use a folio
The PG_idle and PG_young bits are ignored if they're set on tail
pages, so ensure we're passing a folio around.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Matthew Wilcox (Oracle) [Thu, 3 Feb 2022 16:40:17 +0000 (11:40 -0500)]
mm: Convert page_vma_mapped_walk to work on PFNs
page_mapped_in_vma() really just wants to walk one page, but as the
code stands, if passed the head page of a compound page, it will
walk every page in the compound page. Extract pfn/nr_pages/pgoff
from the struct page early, so they can be overridden by
page_mapped_in_vma().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Matthew Wilcox (Oracle) [Tue, 15 Feb 2022 22:50:17 +0000 (17:50 -0500)]
sparc32: Add pmd_pfn()
We need to use this function in common code; pull it out of pmd_page().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Matthew Wilcox (Oracle) [Mon, 14 Feb 2022 15:30:07 +0000 (10:30 -0500)]
powerpc: Add pmd_pfn()
This is straightforward for everything except nohash64 where we
indirect through pmd_page(). There must be a better way to do this.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Matthew Wilcox (Oracle) [Mon, 14 Feb 2022 14:59:11 +0000 (09:59 -0500)]
mips: Make pmd_pfn() available in all configurations
Whether or not the platform supports PMD sized pages, we need to
provide pmd_pfn() for an upcoming patch.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Mike Rapoport [Fri, 4 Feb 2022 18:49:20 +0000 (13:49 -0500)]
arch: Add pmd_pfn() where it is missing
We need to use this function in common code, so define it for
architectures and/or configrations that miss it. The result of
pmd_pfn() will only be used if TRANSPARENT_HUGEPAGE is enabled,
but a function or macro called pmd_pfn() must be defined, even
on machines with two level page tables.
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Matthew Wilcox (Oracle) [Thu, 3 Feb 2022 14:06:08 +0000 (09:06 -0500)]
mm: Add DEFINE_PAGE_VMA_WALK and DEFINE_FOLIO_VMA_WALK
Instead of declaring a struct page_vma_mapped_walk directly,
use these helpers to allow us to transition to a PFN approach in the
following patches.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Matthew Wilcox (Oracle) [Thu, 3 Feb 2022 04:29:45 +0000 (23:29 -0500)]
mm: Add folio_pgoff()
This is the folio equivalent of page_to_pgoff().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Matthew Wilcox (Oracle) [Tue, 18 Jan 2022 13:56:47 +0000 (08:56 -0500)]
mm: Add split_folio_to_list()
This is a convenience function; split_huge_page_to_list() can take
any page in a folio (and does so on purpose because that page will
be the one which keeps the refcount). But it's convenient for the
callers to pass the folio instead of the first page in the folio.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Matthew Wilcox (Oracle) [Mon, 17 Jan 2022 21:33:26 +0000 (16:33 -0500)]
mm: Add folio_mapcount()
This implements the same algorithm as total_mapcount(), which is
transformed into a wrapper function.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Matthew Wilcox (Oracle) [Tue, 18 Jan 2022 15:50:48 +0000 (10:50 -0500)]
mm: Turn head_compound_mapcount() into folio_entire_mapcount()
Adjust documentation to be more clear.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Matthew Wilcox (Oracle) [Mon, 17 Jan 2022 19:35:22 +0000 (14:35 -0500)]
mm/vmscan: Turn page_check_dirty_writeback() into folio_check_dirty_writeback()
Saves a few calls to compound_head().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Matthew Wilcox (Oracle) [Sun, 13 Feb 2022 22:23:58 +0000 (17:23 -0500)]
fs: Move many prototypes to pagemap.h
These functions are page cache functionality and don't need to be
declared in fs.h.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Matthew Wilcox (Oracle) [Sun, 13 Feb 2022 22:22:10 +0000 (17:22 -0500)]
mm/truncate: Combine invalidate_mapping_pagevec() and __invalidate_mapping_pages()
We can save a function call by combining these two functions, which
are identical except for the return value. Also move the prototype
to mm/internal.h.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Matthew Wilcox (Oracle) [Sun, 13 Feb 2022 21:40:24 +0000 (16:40 -0500)]
mm: Turn deactivate_file_page() into deactivate_file_folio()
This function has one caller which already has a reference to the
page, so we don't need to use get_page_unless_zero(). Also move the
prototype to mm/internal.h.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Matthew Wilcox (Oracle) [Sun, 13 Feb 2022 21:38:07 +0000 (16:38 -0500)]
mm/truncate: Convert __invalidate_mapping_pages() to use a folio
Now we can call mapping_evict_folio() instead of invalidate_inode_page()
and save a few calls to compound_head().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Matthew Wilcox (Oracle) [Sun, 13 Feb 2022 20:22:28 +0000 (15:22 -0500)]
mm/truncate: Split invalidate_inode_page() into mapping_evict_folio()
Some of the callers already have the address_space and can avoid calling
folio_mapping() and checking if the folio was already truncated. Also
add kernel-doc and fix the return type (in case we ever support folios
larger than 4TB).
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Matthew Wilcox (Oracle) [Sun, 13 Feb 2022 03:48:55 +0000 (22:48 -0500)]
mm: Convert remove_mapping() to take a folio
Add kernel-doc and return the number of pages removed in order to
get the statistics right in __invalidate_mapping_pages().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Matthew Wilcox (Oracle) [Sat, 12 Feb 2022 22:43:16 +0000 (17:43 -0500)]
mm/truncate: Replace page_mapped() call in invalidate_inode_page()
folio_mapped() is expensive because it has to check each page's mapcount
field. A cheaper check is whether there are any extra references to
the page, other than the one we own, one from the page private data and
the ones held by the page cache.
The call to remove_mapping() will fail in any case if it cannot freeze
the refcount, but failing here avoids cycling the i_pages spinlock.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Matthew Wilcox (Oracle) [Sat, 12 Feb 2022 22:39:10 +0000 (17:39 -0500)]
mm/truncate: Convert invalidate_inode_page() to use a folio
This saves a number of calls to compound_head().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Matthew Wilcox (Oracle) [Sat, 12 Feb 2022 20:27:42 +0000 (15:27 -0500)]
mm/truncate: Inline invalidate_complete_page() into its one caller
invalidate_inode_page() is the only caller of invalidate_complete_page()
and inlining it reveals that the first check is unnecessary (because we
hold the page locked, and we just retrieved the mapping from the page).
Actually, it does make a difference, in that tail pages no longer fail
at this check, so it's now possible to remove a tail page from a mapping.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Matthew Wilcox (Oracle) [Sat, 12 Feb 2022 04:39:03 +0000 (23:39 -0500)]
splice: Use a folio in page_cache_pipe_buf_try_steal()
This saves a lot of calls to compound_head().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Matthew Wilcox (Oracle) [Thu, 23 Dec 2021 21:50:23 +0000 (16:50 -0500)]
mm/vmscan: Convert __remove_mapping() to take a folio
This removes a few hidden calls to compound_head().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Matthew Wilcox (Oracle) [Fri, 21 Jan 2022 13:41:46 +0000 (08:41 -0500)]
mm: Turn putback_lru_page() into folio_putback_lru()
Add a putback_lru_page() wrapper. Removes a couple of compound_head()
calls.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Matthew Wilcox (Oracle) [Mon, 17 Jan 2022 19:40:12 +0000 (14:40 -0500)]
mm: Add lru_to_folio()
Since page->lru occupies the same bytes as compound_head, any page
on the LRU list must be a folio.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Matthew Wilcox (Oracle) [Tue, 28 Dec 2021 02:11:34 +0000 (21:11 -0500)]
mm/memcg: Convert mem_cgroup_swapout() to take a folio
This removes an assumption that THPs are the only kind of compound
pages and removes a couple of hidden calls to compound_head. It
also documents that you can't pass a tail page to mem_cgroup_swapout().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Matthew Wilcox (Oracle) [Thu, 23 Dec 2021 21:39:05 +0000 (16:39 -0500)]
mm/workingset: Convert workingset_eviction() to take a folio
This removes an assumption that THPs are the only kind of compound
pages and removes a few hidden calls to compound_head().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Matthew Wilcox (Oracle) [Thu, 17 Feb 2022 17:46:35 +0000 (12:46 -0500)]
mm/gup: Convert check_and_migrate_movable_pages() to use a folio
Switch from head pages to folios. This removes an assumption that
THPs are the only way to have a high-order page.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Matthew Wilcox (Oracle) [Mon, 21 Mar 2022 16:57:38 +0000 (12:57 -0400)]
mm: Add three folio wrappers
folio_is_zone_device() is equivalent to is_zone_device_page(),
folio_is_device_private() is equivalent to is_device_private_page(),
and folio_is_pinnable() is equivalent to is_pinnable_page().
All of these tests return the same result for every page in the folio,
so we can just pass the head page of the folio to the page variant of
the function.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Matthew Wilcox (Oracle) [Fri, 24 Dec 2021 18:26:22 +0000 (13:26 -0500)]
mm: Turn isolate_lru_page() into folio_isolate_lru()
Add isolate_lru_page() as a wrapper around isolate_lru_folio().
TestClearPageLRU() would have always failed on a tail page, so
returning -EBUSY is the same behaviour.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Matthew Wilcox (Oracle) [Thu, 23 Dec 2021 15:20:12 +0000 (10:20 -0500)]
mm/gup: Turn compound_range_next() into gup_folio_range_next()
Convert the only caller to work on folios instead of pages.
This removes the last caller of put_compound_head(), so delete it.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Matthew Wilcox (Oracle) [Thu, 23 Dec 2021 04:43:16 +0000 (23:43 -0500)]
mm/gup: Turn compound_next() into gup_folio_next()
Convert both callers to work on folios instead of pages.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Matthew Wilcox (Oracle) [Thu, 23 Dec 2021 03:30:29 +0000 (22:30 -0500)]
mm/gup: Convert gup_huge_pgd() to use a folio
Use the new folio-based APIs. This was the last user of
try_grab_compound_head(), so remove it.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Matthew Wilcox (Oracle) [Wed, 22 Dec 2021 23:07:47 +0000 (18:07 -0500)]
mm/gup: Convert gup_huge_pud() to use a folio
Use the new folio-based APIs.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Matthew Wilcox (Oracle) [Wed, 22 Dec 2021 21:57:23 +0000 (16:57 -0500)]
mm/gup: Convert gup_huge_pmd() to use a folio
Use the new folio-based APIs.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Matthew Wilcox (Oracle) [Wed, 22 Dec 2021 21:38:30 +0000 (16:38 -0500)]
mm/gup: Convert gup_hugepte() to use a folio
There should be little to no effect from this patch; just removing
uses of some old APIs.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Matthew Wilcox (Oracle) [Fri, 10 Dec 2021 20:54:11 +0000 (15:54 -0500)]
mm/gup: Convert gup_pte_range() to use a folio
We still call try_grab_folio() once per PTE; a future patch could
optimise to just adjust the reference count for each page within
the folio.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Matthew Wilcox (Oracle) [Sat, 8 Jan 2022 05:15:04 +0000 (00:15 -0500)]
mm/hugetlb: Use try_grab_folio() instead of try_grab_compound_head()
follow_hugetlb_page() only cares about success or failure, so it doesn't
need to know the type of the returned pointer, only whether it's NULL
or not.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Matthew Wilcox (Oracle) [Fri, 10 Dec 2021 20:39:04 +0000 (15:39 -0500)]
mm/gup: Add gup_put_folio()
Convert put_compound_head() to gup_put_folio() and hpage_pincount_sub()
to folio_pincount_sub(). This removes the last call to put_page_refs(),
so delete it. Add a temporary put_compound_head() wrapper which will
be deleted by the end of this series.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Matthew Wilcox (Oracle) [Wed, 29 Dec 2021 17:23:55 +0000 (12:23 -0500)]
mm: Remove page_cache_add_speculative() and page_cache_get_speculative()
These wrappers have no more callers, so delete them.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Matthew Wilcox (Oracle) [Fri, 4 Feb 2022 15:32:01 +0000 (10:32 -0500)]
mm/gup: Convert try_grab_page() to use a folio
Hoist the folio conversion and the folio_ref_count() check to the
top of the function instead of using the one buried in try_get_page().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Matthew Wilcox (Oracle) [Fri, 4 Feb 2022 15:27:40 +0000 (10:27 -0500)]
mm/gup: Add try_get_folio() and try_grab_folio()
Convert try_get_compound_head() into try_get_folio() and convert
try_grab_compound_head() into try_grab_folio(). Add a temporary
try_grab_compound_head() wrapper around try_grab_folio() to let us
convert callers individually.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Matthew Wilcox (Oracle) [Mon, 27 Dec 2021 23:40:41 +0000 (18:40 -0500)]
mm: Turn page_maybe_dma_pinned() into folio_maybe_dma_pinned()
Replace three calls to compound_head() with one. This removes the last
user of compound_pincount(), so remove that helper too.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Matthew Wilcox (Oracle) [Mon, 27 Dec 2021 23:28:58 +0000 (18:28 -0500)]
mm: Add folio_pincount_ptr()
This is the folio equivalent of compound_pincount_ptr().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Matthew Wilcox (Oracle) [Thu, 6 Jan 2022 21:46:43 +0000 (16:46 -0500)]
mm: Make compound_pincount always available
Move compound_pincount from the third page to the second page, which
means it's available for all compound pages. That lets us delete
hpage_pincount_available().
On 32-bit systems, there isn't enough space for both compound_pincount
and compound_nr in the second page (it would collide with page->private,
which is in use for pages in the swap cache), so revert the optimisation
of storing both compound_order and compound_nr on 32-bit systems.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Matthew Wilcox (Oracle) [Fri, 7 Jan 2022 19:19:39 +0000 (14:19 -0500)]
mm/gup: Remove hpage_pincount_sub()
Move the assertion (and correct it to be a cheaper variant),
and inline the atomic_sub() operation.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Matthew Wilcox (Oracle) [Fri, 7 Jan 2022 19:15:11 +0000 (14:15 -0500)]
mm/gup: Remove hpage_pincount_add()
It's clearer to call atomic_add() in the callers; the assertions clearly
can't fire there because they're part of the condition for calling
atomic_add().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Matthew Wilcox (Oracle) [Fri, 7 Jan 2022 19:04:55 +0000 (14:04 -0500)]
mm/gup: Handle page split race more efficiently
If we hit the page split race, the current code returns NULL which will
presumably trigger a retry under the mmap_lock. This isn't necessary;
we can just retry the compound_head() lookup. This is a very minor
optimisation of an unlikely path, but conceptually it matches (eg)
the page cache RCU-protected lookup.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Matthew Wilcox (Oracle) [Fri, 7 Jan 2022 18:45:25 +0000 (13:45 -0500)]
mm/gup: Remove an assumption of a contiguous memmap
This assumption needs the inverse of nth_page(), which is temporarily
named page_nth() until it's renamed later in this series.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Matthew Wilcox (Oracle) [Fri, 7 Jan 2022 18:25:55 +0000 (13:25 -0500)]
mm/gup: Fix some contiguous memmap assumptions
Several functions in gup.c assume that a compound page has virtually
contiguous page structs. This isn't true for SPARSEMEM configs unless
SPARSEMEM_VMEMMAP is also set. Fix them by using nth_page() instead of
plain pointer arithmetic.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Matthew Wilcox (Oracle) [Mon, 10 Jan 2022 02:03:47 +0000 (21:03 -0500)]
mm/gup: Change the calling convention for compound_next()
Return the head page instead of storing it to a passed parameter.
Reorder the arguments to match the calling function's arguments.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Matthew Wilcox (Oracle) [Sun, 9 Jan 2022 21:21:23 +0000 (16:21 -0500)]
mm/gup: Optimise compound_range_next()
By definition, a compound page has an order >= 1, so the second half
of the test was redundant. Also, this cannot be a tail page since
it's the result of calling compound_head(), so use PageHead() instead
of PageCompound().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Matthew Wilcox (Oracle) [Sun, 9 Jan 2022 21:05:11 +0000 (16:05 -0500)]
mm/gup: Change the calling convention for compound_range_next()
Return the head page instead of storing it to a passed parameter.
Pass the start page directly instead of passing a pointer to it.
Reorder the arguments to match the calling function's arguments.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Matthew Wilcox (Oracle) [Sun, 9 Jan 2022 01:23:46 +0000 (20:23 -0500)]
mm/gup: Remove for_each_compound_head()
This macro doesn't simplify the users; it's easier to just call
compound_next() inside a standard loop.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Matthew Wilcox (Oracle) [Sun, 9 Jan 2022 01:23:46 +0000 (20:23 -0500)]
mm/gup: Remove for_each_compound_range()
This macro doesn't simplify the users; it's easier to just call
compound_range_next() inside the loop.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Matthew Wilcox (Oracle) [Fri, 4 Feb 2022 14:24:26 +0000 (09:24 -0500)]
mm/gup: Increment the page refcount before the pincount
We should always increase the refcount before doing anything else to
the page so that other page users see the elevated refcount first.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig [Wed, 16 Feb 2022 04:31:38 +0000 (15:31 +1100)]
mm: build migrate_vma_* for all configs with ZONE_DEVICE support
This code will be used for device coherent memory as well in a bit,
so relax the ifdef a bit.
Link: https://lkml.kernel.org/r/20220210072828.2930359-15-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Christoph Hellwig [Wed, 16 Feb 2022 04:31:38 +0000 (15:31 +1100)]
mm: move the migrate_vma_* device migration code into its own file
Split the code used to migrate to and from ZONE_DEVICE memory from
migrate.c into a new file.
Link: https://lkml.kernel.org/r/20220210072828.2930359-14-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Christoph Hellwig [Wed, 16 Feb 2022 04:31:37 +0000 (15:31 +1100)]
mm: refactor the ZONE_DEVICE handling in migrate_vma_pages
Make the flow a little more clear and prepare for adding a new
ZONE_DEVICE memory type.
Link: https://lkml.kernel.org/r/20220210072828.2930359-13-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Christoph Hellwig [Wed, 16 Feb 2022 04:31:37 +0000 (15:31 +1100)]
mm: refactor the ZONE_DEVICE handling in migrate_vma_insert_page
Make the flow a little more clear and prepare for adding a new
ZONE_DEVICE memory type.
Link: https://lkml.kernel.org/r/20220210072828.2930359-12-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Christoph Hellwig [Wed, 16 Feb 2022 04:31:37 +0000 (15:31 +1100)]
mm: refactor check_and_migrate_movable_pages
Remove up to two levels of indentation by using continue statements
and move variables to local scope where possible.
Link: https://lkml.kernel.org/r/20220210072828.2930359-11-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Christoph Hellwig [Wed, 16 Feb 2022 04:31:37 +0000 (15:31 +1100)]
mm: generalize the pgmap based page_free infrastructure
Key off on the existence of ->page_free to prepare for adding support for
more pgmap types that are device managed and thus need the free callback.
Link: https://lkml.kernel.org/r/20220210072828.2930359-10-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Christoph Hellwig [Wed, 16 Feb 2022 04:31:37 +0000 (15:31 +1100)]
fsdax: depend on ZONE_DEVICE || FS_DAX_LIMITED
Add a depends on ZONE_DEVICE support or the s390-specific limited DAX
support, as one of the two is required at runtime for fsdax code to
actually work.
Link: https://lkml.kernel.org/r/20220210072828.2930359-9-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Christoph Hellwig [Wed, 16 Feb 2022 04:31:36 +0000 (15:31 +1100)]
mm: remove the extra ZONE_DEVICE struct page refcount
ZONE_DEVICE struct pages have an extra reference count that complicates
the code for put_page() and several places in the kernel that need to
check the reference count to see that a page is not being used (gup,
compaction, migration, etc.). Clean up the code so the reference count
doesn't need to be treated specially for ZONE_DEVICE pages.
Note that this excludes the special idle page wakeup for fsdax pages,
which still happens at refcount 1. This is a separate issue and will
be sorted out later. Given that only fsdax pages require the
notifiacation when the refcount hits 1 now, the PAGEMAP_OPS Kconfig
symbol can go away and be replaced with a FS_DAX check for this hook
in the put_page fastpath.
Based on an earlier patch from Ralph Campbell <rcampbell@nvidia.com>.
Link: https://lkml.kernel.org/r/20220210072828.2930359-8-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Christoph Hellwig [Wed, 16 Feb 2022 04:31:36 +0000 (15:31 +1100)]
mm: don't include <linux/memremap.h> in <linux/mm.h>
Move the check for the actual pgmap types that need the free at refcount
one behavior into the out of line helper, and thus avoid the need to
pull memremap.h into mm.h.
Link: https://lkml.kernel.org/r/20220210072828.2930359-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Christoph Hellwig [Wed, 16 Feb 2022 04:31:35 +0000 (15:31 +1100)]
mm: simplify freeing of devmap managed pages
Make put_devmap_managed_page return if it took charge of the page
or not and remove the separate page_is_devmap_managed helper.
Link: https://lkml.kernel.org/r/20220210072828.2930359-6-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Christoph Hellwig [Wed, 16 Feb 2022 04:31:35 +0000 (15:31 +1100)]
mm: move free_devmap_managed_page to memremap.c
free_devmap_managed_page has nothing to do with the code in swap.c,
move it to live with the rest of the code for devmap handling.
Link: https://lkml.kernel.org/r/20220210072828.2930359-5-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Christoph Hellwig [Wed, 16 Feb 2022 04:31:35 +0000 (15:31 +1100)]
mm: remove pointless includes from <linux/hmm.h>
hmm.h pulls in the world for no good reason at all. Remove the
includes and push a few ones into the users instead.
Link: https://lkml.kernel.org/r/20220210072828.2930359-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Christoph Hellwig [Wed, 16 Feb 2022 04:31:35 +0000 (15:31 +1100)]
mm: remove the __KERNEL__ guard from <linux/mm.h>
__KERNEL__ ifdefs don't make sense outside of include/uapi/.
Link: https://lkml.kernel.org/r/20220210072828.2930359-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Christoph Hellwig [Wed, 16 Feb 2022 04:31:35 +0000 (15:31 +1100)]
mm: remove a pointless CONFIG_ZONE_DEVICE check in memremap_pages
Patch series "start sorting out the ZONE_DEVICE refcount mess", v2.
This series removes the offset by one refcount for ZONE_DEVICE pages that
are freed back to the driver owning them, which is just device private
ones for now, but also the planned device coherent pages and the ehanced
p2p ones pending.
It does not address the fsdax pages yet, which will be attacked in a
follow on series.
This patch (of 27):
memremap.c is only built when CONFIG_ZONE_DEVICE is set, so remove
the superflous extra check.
Link: https://lkml.kernel.org/r/20220210072828.2930359-1-hch@lst.de
Link: https://lkml.kernel.org/r/20220210072828.2930359-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Hugh Dickins [Thu, 3 Mar 2022 01:35:30 +0000 (17:35 -0800)]
mm/munlock: mlock_vma_page() check against VM_SPECIAL
Although mmap_region() and mlock_fixup() take care that VM_LOCKED
is never left set on a VM_SPECIAL vma, there is an interval while
file->f_op->mmap() is using vm_insert_page(s), when VM_LOCKED may
still be set while VM_SPECIAL bits are added: so mlock_vma_page()
should ignore VM_LOCKED while any VM_SPECIAL bits are set.
This showed up as a "Bad page" still mlocked, when vfree()ing pages
which had been vm_inserted by remap_vmalloc_range_partial(): while
release_pages() and __page_cache_release(), and so put_page(), catch
pages still mlocked when freeing (and clear_page_mlock() caught them
when unmapping), the vfree() path is unprepared for them: fix it?
but these pages should not have been mlocked in the first place.
I assume that an mlockall(MCL_FUTURE) had been done in the past; or
maybe the user got to specify MAP_LOCKED on a vmalloc'ing driver mmap.
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Hugh Dickins [Tue, 15 Feb 2022 02:42:33 +0000 (18:42 -0800)]
mm/thp: shrink_page_list() avoid splitting VM_LOCKED THP
4.8 commit
7751b2da6be0 ("vmscan: split file huge pages before paging
them out") inserted a split_huge_page_to_list() into shrink_page_list()
without considering the mlock case: no problem if the page has already
been marked as Mlocked (the !page_evictable check much higher up will
have skipped all this), but it has always been the case that races or
omissions in setting Mlocked can rely on page reclaim to detect this
and correct it before actually reclaiming - and that remains so, but
what a shame if a hugepage is needlessly split before discovering it.
It is surprising that page_check_references() returns PAGEREF_RECLAIM
when VM_LOCKED, but there was a good reason for that: try_to_unmap_one()
is where the condition is detected and corrected; and until now it could
not be done in page_referenced_one(), because that does not always have
the page locked. Now that mlock's requirement for page lock has gone,
copy try_to_unmap_one()'s mlock restoration into page_referenced_one(),
and let page_check_references() return PAGEREF_ACTIVATE in this case.
But page_referenced_one() may find a pte mapping one part of a hugepage:
what hold should a pte mapped in a VM_LOCKED area exert over the entire
huge page? That's debatable. The approach taken here is to treat that
pte mapping in page_referenced_one() as if not VM_LOCKED, and if no
VM_LOCKED pmd mapping is found later in the walk, and lack of reference
permits, then PAGEREF_RECLAIM take it to attempted splitting as before.
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Hugh Dickins [Tue, 15 Feb 2022 02:40:55 +0000 (18:40 -0800)]
mm/thp: collapse_file() do try_to_unmap(TTU_BATCH_FLUSH)
collapse_file() is using unmap_mapping_pages(1) on each small page found
mapped, unlike others (reclaim, migration, splitting, memory-failure) who
use try_to_unmap(). There are four advantages to try_to_unmap(): first,
its TTU_IGNORE_MLOCK option now avoids leaving mlocked page in pagevec;
second, its vma lookup uses i_mmap_lock_read() not i_mmap_lock_write();
third, it breaks out early if page is not mapped everywhere it might be;
fourth, its TTU_BATCH_FLUSH option can be used, as in page reclaim, to
save up all the TLB flushing until all of the pages have been unmapped.
Wild guess: perhaps it was originally written to use try_to_unmap(),
but hit the VM_BUG_ON_PAGE(page_mapped) after unmapping, because without
TTU_SYNC it may skip page table locks; but unmap_mapping_pages() never
skips them, so fixed the issue. I did once hit that VM_BUG_ON_PAGE()
since making this change: we could pass TTU_SYNC here, but I think just
delete the check - the race is very rare, this is an ordinary small page
so we don't need to be so paranoid about mapcount surprises, and the
page_ref_freeze() just below already handles the case adequately.
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Hugh Dickins [Tue, 15 Feb 2022 02:38:47 +0000 (18:38 -0800)]
mm/munlock: page migration needs mlock pagevec drained
Page migration of a VM_LOCKED page tends to fail, because when the old
page is unmapped, it is put on the mlock pagevec with raised refcount,
which then fails the freeze.
At first I thought this would be fixed by a local mlock_page_drain() at
the upper rmap_walk() level - which would have nicely batched all the
munlocks of that page; but tests show that the task can too easily move
to another cpu, leaving pagevec residue behind which fails the migration.
So try_to_migrate_one() drain the local pagevec after page_remove_rmap()
from a VM_LOCKED vma; and do the same in try_to_unmap_one(), whose
TTU_IGNORE_MLOCK users would want the same treatment; and do the same
in remove_migration_pte() - not important when successfully inserting
a new page, but necessary when hoping to retry after failure.
Any new pagevec runs the risk of adding a new way of stranding, and we
might discover other corners where mlock_page_drain() or lru_add_drain()
would now help.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Hugh Dickins [Tue, 15 Feb 2022 02:37:29 +0000 (18:37 -0800)]
mm/munlock: mlock_page() munlock_page() batch by pagevec
A weakness of the page->mlock_count approach is the need for lruvec lock
while holding page table lock. That is not an overhead we would allow on
normal pages, but I think acceptable just for pages in an mlocked area.
But let's try to amortize the extra cost by gathering on per-cpu pagevec
before acquiring the lruvec lock.
I have an unverified conjecture that the mlock pagevec might work out
well for delaying the mlock processing of new file pages until they have
got off lru_cache_add()'s pagevec and on to LRU.
The initialization of page->mlock_count is subject to races and awkward:
0 or !!PageMlocked or 1? Was it wrong even in the implementation before
this commit, which just widens the window? I haven't gone back to think
it through. Maybe someone can point out a better way to initialize it.
Bringing lru_cache_add_inactive_or_unevictable()'s mlock initialization
into mm/mlock.c has helped: mlock_new_page(), using the mlock pagevec,
rather than lru_cache_add()'s pagevec.
Experimented with various orderings: the right thing seems to be for
mlock_page() and mlock_new_page() to TestSetPageMlocked before adding to
pagevec, but munlock_page() to leave TestClearPageMlocked to the later
pagevec processing.
Dropped the VM_BUG_ON_PAGE(PageTail)s this time around: they have made
their point, and the thp_nr_page()s already contain a VM_BUG_ON_PGFLAGS()
for that.
This still leaves acquiring lruvec locks under page table lock each time
the pagevec fills (or a THP is added): which I suppose is rather silly,
since they sit on pagevec waiting to be processed long after page table
lock has been dropped; but I'm disinclined to uglify the calling sequence
until some load shows an actual problem with it (nothing wrong with
taking lruvec lock under page table lock, just "nicer" to do it less).
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Hugh Dickins [Tue, 15 Feb 2022 02:34:46 +0000 (18:34 -0800)]
mm/munlock: delete smp_mb() from __pagevec_lru_add_fn()
My reading of comment on smp_mb__after_atomic() in __pagevec_lru_add_fn()
says that it can now be deleted; and that remains so when the next patch
is added.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Hugh Dickins [Tue, 15 Feb 2022 02:33:17 +0000 (18:33 -0800)]
mm/migrate: __unmap_and_move() push good newpage to LRU
Compaction, NUMA page movement, THP collapse/split, and memory failure
do isolate unevictable pages from their "LRU", losing the record of
mlock_count in doing so (isolators are likely to use page->lru for their
own private lists, so mlock_count has to be presumed lost).
That's unfortunate, and we should put in some work to correct that: one
can imagine a function to build up the mlock_count again - but it would
require i_mmap_rwsem for read, so be careful where it's called. Or
page_referenced_one() and try_to_unmap_one() might do that extra work.
But one place that can very easily be improved is page migration's
__unmap_and_move(): a small adjustment to where the successful new page
is put back on LRU, and its mlock_count (if any) is built back up by
remove_migration_ptes().
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Hugh Dickins [Tue, 15 Feb 2022 02:31:48 +0000 (18:31 -0800)]
mm/munlock: mlock_pte_range() when mlocking or munlocking
Fill in missing pieces: reimplementation of munlock_vma_pages_range(),
required to lower the mlock_counts when munlocking without munmapping;
and its complement, implementation of mlock_vma_pages_range(), required
to raise the mlock_counts on pages already there when a range is mlocked.
Combine them into just the one function mlock_vma_pages_range(), using
walk_page_range() to run mlock_pte_range(). This approach fixes the
"Very slow unlockall()" of unpopulated PROT_NONE areas, reported in
https://lore.kernel.org/linux-mm/
70885d37-62b7-748b-29df-
9e94f3291736@gmail.com/
Munlock clears VM_LOCKED at the start, under exclusive mmap_lock; but if
a racing truncate or holepunch (depending on i_mmap_rwsem) gets to the
pte first, it will not try to munlock the page: leaving release_pages()
to correct it when the last reference to the page is gone - that's okay,
a page is not evictable anyway while it is held by an extra reference.
Mlock sets VM_LOCKED at the start, under exclusive mmap_lock; but if
a racing remove_migration_pte() or try_to_unmap_one() (depending on
i_mmap_rwsem) gets to the pte first, it will try to mlock the page,
then mlock_pte_range() mlock it a second time. This is harder to
reproduce, but a more serious race because it could leave the page
unevictable indefinitely though the area is munlocked afterwards.
Guard against it by setting the (inappropriate) VM_IO flag,
and modifying mlock_vma_page() to decline such vmas.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Hugh Dickins [Tue, 15 Feb 2022 02:29:54 +0000 (18:29 -0800)]
mm/munlock: maintain page->mlock_count while unevictable
Previous patches have been preparatory: now implement page->mlock_count.
The ordering of the "Unevictable LRU" is of no significance, and there is
no point holding unevictable pages on a list: place page->mlock_count to
overlay page->lru.prev (since page->lru.next is overlaid by compound_head,
which needs to be even so as not to satisfy PageTail - though 2 could be
added instead of 1 for each mlock, if that's ever an improvement).
But it's only safe to rely on or modify page->mlock_count while lruvec
lock is held and page is on unevictable "LRU" - we can save lots of edits
by continuing to pretend that there's an imaginary LRU here (there is an
unevictable count which still needs to be maintained, but not a list).
The mlock_count technique suffers from an unreliability much like with
page_mlock(): while someone else has the page off LRU, not much can
be done. As before, err on the safe side (behave as if mlock_count 0),
and let try_to_unlock_one() move the page to unevictable if reclaim finds
out later on - a few misplaced pages don't matter, what we want to avoid
is imbalancing reclaim by flooding evictable lists with unevictable pages.
I am not a fan of "if (!isolate_lru_page(page)) putback_lru_page(page);":
if we have taken lruvec lock to get the page off its present list, then
we save everyone trouble (and however many extra atomic ops) by putting
it on its destination list immediately.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Hugh Dickins [Tue, 15 Feb 2022 02:28:05 +0000 (18:28 -0800)]
mm/munlock: replace clear_page_mlock() by final clearance
Placing munlock_vma_page() at the end of page_remove_rmap() shifts most
of the munlocking to clear_page_mlock(), since PageMlocked is typically
still set when mapcount has fallen to 0. That is not what we want: we
want /proc/vmstat's unevictable_pgs_cleared to remain as a useful check
on the integrity of of the mlock/munlock protocol - small numbers are
not surprising, but big numbers mean the protocol is not working.
That could be easily fixed by placing munlock_vma_page() at the start of
page_remove_rmap(); but later in the series we shall want to batch the
munlocking, and that too would tend to leave PageMlocked still set at
the point when it is checked.
So delete clear_page_mlock() now: leave it instead to release_pages()
(and __page_cache_release()) to do this backstop clearing of Mlocked,
when page refcount has fallen to 0. If a pinned page occasionally gets
counted as Mlocked and Unevictable until it is unpinned, that's okay.
A slightly regrettable side-effect of this change is that, since
release_pages() and __page_cache_release() may be called at interrupt
time, those places which update NR_MLOCK with interrupts enabled
had better use mod_zone_page_state() than __mod_zone_page_state()
(but holding the lruvec lock always has interrupts disabled).
This change, forcing Mlocked off when refcount 0 instead of earlier
when mapcount 0, is not fundamental: it can be reversed if performance
or something else is found to suffer; but this is the easiest way to
separate the stats - let's not complicate that without good reason.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Hugh Dickins [Tue, 15 Feb 2022 02:26:39 +0000 (18:26 -0800)]
mm/munlock: rmap call mlock_vma_page() munlock_vma_page()
Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
inline functions which check (vma->vm_flags & VM_LOCKED) before calling
mlock_page() and munlock_page() in mm/mlock.c.
Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
because we have understandable difficulty in accounting pte maps of THPs,
and if passed a PageHead page, mlock_page() and munlock_page() cannot
tell whether it's a pmd map to be counted or a pte map to be ignored.
Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the
others, and use that to call mlock_vma_page() at the end of the page
adds, and munlock_vma_page() at the end of page_remove_rmap() (end or
beginning? unimportant, but end was easier for assertions in testing).
No page lock is required (although almost all adds happen to hold it):
delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s.
Certainly page lock did serialize with page migration, but I'm having
difficulty explaining why that was ever important.
Mlock accounting on THPs has been hard to define, differed between anon
and file, involved PageDoubleMap in some places and not others, required
clear_page_mlock() at some points. Keep it simple now: just count the
pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.
page_add_new_anon_rmap() callers unchanged: they have long been calling
lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED
handling (it also checks for not VM_SPECIAL: I think that's overcautious,
and inconsistent with other checks, that mmap_region() already prevents
VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it).
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Hugh Dickins [Tue, 15 Feb 2022 02:23:29 +0000 (18:23 -0800)]
mm/munlock: delete munlock_vma_pages_all(), allow oomreap
munlock_vma_pages_range() will still be required, when munlocking but
not munmapping a set of pages; but when unmapping a pte, the mlock count
will be maintained in much the same way as it will be maintained when
mapping in the pte. Which removes the need for munlock_vma_pages_all()
on mlocked vmas when munmapping or exiting: eliminating the catastrophic
contention on i_mmap_rwsem, and the need for page lock on the pages.
There is still a need to update locked_vm accounting according to the
munmapped vmas when munmapping: do that in detach_vmas_to_be_unmapped().
exit_mmap() does not need locked_vm updates, so delete unlock_range().
And wasn't I the one who forbade the OOM reaper to attack mlocked vmas,
because of the uncertainty in blocking on all those page locks?
No fear of that now, so permit the OOM reaper on mlocked vmas.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Hugh Dickins [Tue, 15 Feb 2022 02:21:52 +0000 (18:21 -0800)]
mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE
If counting page mlocks, we must not double-count: follow_page_pte() can
tell if a page has already been Mlocked or not, but cannot tell if a pte
has already been counted or not: that will have to be done when the pte
is mapped in (which lru_cache_add_inactive_or_unevictable() already tracks
for new anon pages, but there's no such tracking yet for others).
Delete all the FOLL_MLOCK code - faulting in the missing pages will do
all that is necessary, without special mlock_vma_page() calls from here.
But then FOLL_POPULATE turns out to serve no purpose - it was there so
that its absence would tell faultin_page() not to faultin page when
setting up VM_LOCKONFAULT areas; but if there's no special work needed
here for mlock, then there's no work at all here for VM_LOCKONFAULT.
Have I got that right? I've not looked into the history, but see that
FOLL_POPULATE goes back before VM_LOCKONFAULT: did it serve a different
purpose before? Ah, yes, it was used to skip the old stack guard page.
And is it intentional that COW is not broken on existing pages when
setting up a VM_LOCKONFAULT area? I can see that being argued either
way, and have no reason to disagree with current behaviour.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Hugh Dickins [Tue, 15 Feb 2022 02:20:24 +0000 (18:20 -0800)]
mm/munlock: delete page_mlock() and all its works
We have recommended some applications to mlock their userspace, but that
turns out to be counter-productive: when many processes mlock the same
file, contention on rmap's i_mmap_rwsem can become intolerable at exit: it
is needed for write, to remove any vma mapping that file from rmap's tree;
but hogged for read by those with mlocks calling page_mlock() (formerly
known as try_to_munlock()) on *each* page mapped from the file (the
purpose being to find out whether another process has the page mlocked,
so therefore it should not be unmlocked yet).
Several optimizations have been made in the past: one is to skip
page_mlock() when mapcount tells that nothing else has this page
mapped; but that doesn't help at all when others do have it mapped.
This time around, I initially intended to add a preliminary search
of the rmap tree for overlapping VM_LOCKED ranges; but that gets
messy with locking order, when in doubt whether a page is actually
present; and risks adding even more contention on the i_mmap_rwsem.
A solution would be much easier, if only there were space in struct page
for an mlock_count... but actually, most of the time, there is space for
it - an mlocked page spends most of its life on an unevictable LRU, but
since 3.18 removed the scan_unevictable_pages sysctl, that "LRU" has
been redundant. Let's try to reuse its page->lru.
But leave that until a later patch: in this patch, clear the ground by
removing page_mlock(), and all the infrastructure that has gathered
around it - which mostly hinders understanding, and will make reviewing
new additions harder. Don't mind those old comments about THPs, they
date from before 4.5's refcounting rework: splitting is not a risk here.
Just keep a minimal version of munlock_vma_page(), as reminder of what it
should attend to (in particular, the odd way PGSTRANDED is counted out of
PGMUNLOCKED), and likewise a stub for munlock_vma_pages_range(). Move
unchanged __mlock_posix_error_return() out of the way, down to above its
caller: this series then makes no further change after mlock_fixup().
After this and each following commit, the kernel builds, boots and runs;
but with deficiencies which may show up in testing of mlock and munlock.
The system calls succeed or fail as before, and mlock remains effective
in preventing page reclaim; but meminfo's Unevictable and Mlocked amounts
may be shown too low after mlock, grow, then stay too high after munlock:
with previously mlocked pages remaining unevictable for too long, until
finally unmapped and freed and counts corrected. Normal service will be
resumed in "mm/munlock: mlock_pte_range() when mlocking or munlocking".
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Linus Torvalds [Wed, 16 Feb 2022 20:09:22 +0000 (12:09 -0800)]
Merge tag 'mmc-v5.17-rc1-2' of git://git./linux/kernel/git/ulfh/mmc
Pull MMC fix from Ulf Hansson:
"Fix recovery logic for multi block I/O reads (MMC_READ_MULTIPLE_BLOCK)"
* tag 'mmc-v5.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: block: fix read single on recovery logic
Linus Torvalds [Tue, 15 Feb 2022 23:28:00 +0000 (15:28 -0800)]
tty: n_tty: do not look ahead for EOL character past the end of the buffer
Daniel Gibson reports that the n_tty code gets line termination wrong in
very specific cases:
"If you feed a line with exactly 64 chars + terminating newline, and
directly afterwards (without reading) another line into a pseudo
terminal, the the first read() on the other side will return the 64
char line *without* terminating newline, and the next read() will
return the missing terminating newline AND the complete next line (if
it fits in the buffer)"
and bisected the behavior to commit
3b830a9c34d5 ("tty: convert
tty_ldisc_ops 'read()' function to take a kernel pointer").
Now, digging deeper, it turns out that the behavior isn't exactly new:
what changed in commit
3b830a9c34d5 was that the tty line discipline
.read() function is now passed an intermediate kernel buffer rather than
the final user space buffer.
And that intermediate kernel buffer is 64 bytes in size - thus that
special case with exactly 64 bytes plus terminating newline.
The same problem did exist before, but historically the boundary was not
the 64-byte chunk, but the user-supplied buffer size, which is obviously
generally bigger (and potentially bigger than N_TTY_BUF_SIZE, which
would hide the issue entirely).
The reason is that the n_tty canon_copy_from_read_buf() code would look
ahead for the EOL character one byte further than it would actually
copy. It would then decide that it had found the terminator, and unmark
it as an EOL character - which in turn explains why the next read
wouldn't then be terminated by it.
Now, the reason it did all this in the first place is related to some
historical and pretty obscure EOF behavior, see commit
ac8f3bf8832a
("n_tty: Fix poll() after buffer-limited eof push read") and commit
40d5e0905a03 ("n_tty: Fix EOF push handling").
And the reason for the EOL confusion is that we treat EOF as a special
EOL condition, with the EOL character being NUL (aka "__DISABLED_CHAR"
in the kernel sources).
So that EOF look-ahead also affects the normal EOL handling.
This patch just removes the look-ahead that causes problems, because EOL
is much more critical than the historical "EOF in the middle of a line
that coincides with the end of the buffer" handling ever was.
Now, it is possible that we should indeed re-introduce the "look at next
character to see if it's a EOF" behavior, but if so, that should be done
not at the kernel buffer chunk boundary in canon_copy_from_read_buf(),
but at a higher level, when we run out of the user buffer.
In particular, the place to do that would be at the top of
'n_tty_read()', where we check if it's a continuation of a previously
started read, and there is no more buffer space left, we could decide to
just eat the __DISABLED_CHAR at that point.
But that would be a separate patch, because I suspect nobody actually
cares, and I'd like to get a report about it before bothering.
Fixes:
3b830a9c34d5 ("tty: convert tty_ldisc_ops 'read()' function to take a kernel pointer")
Fixes:
ac8f3bf8832a ("n_tty: Fix poll() after buffer-limited eof push read")
Fixes:
40d5e0905a03 ("n_tty: Fix EOF push handling")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215611
Reported-and-tested-by: Daniel Gibson <metalcaedes@gmail.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Tue, 15 Feb 2022 19:07:59 +0000 (11:07 -0800)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Read HW interrupt pending state from the HW
x86:
- Don't truncate the performance event mask on AMD
- Fix Xen runstate updates to be atomic when preempting vCPU
- Fix for AMD AVIC interrupt injection race
- Several other AMD fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86/pmu: Use AMD64_RAW_EVENT_MASK for PERF_TYPE_RAW
KVM: x86/pmu: Don't truncate the PerfEvtSeln MSR when creating a perf event
KVM: SVM: fix race between interrupt delivery and AVIC inhibition
KVM: SVM: set IRR in svm_deliver_interrupt
KVM: SVM: extract avic_ring_doorbell
selftests: kvm: Remove absent target file
KVM: arm64: vgic: Read HW interrupt pending state from the HW
KVM: x86/xen: Fix runstate updates to be atomic when preempting vCPU
KVM: x86: SVM: move avic definitions from AMD's spec to svm.h
KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it
KVM: x86: nSVM: deal with L1 hypervisor that intercepts interrupts but lets L2 control them
KVM: x86: nSVM: expose clean bit support to the guest
KVM: x86: nSVM/nVMX: set nested_run_pending on VM entry which is a result of RSM
KVM: x86: nSVM: mark vmcb01 as dirty when restoring SMM saved state
KVM: x86: nSVM: fix potential NULL derefernce on nested migration
KVM: x86: SVM: don't passthrough SMAP/SMEP/PKE bits in !NPT && !gCR0.PG case
Revert "svm: Add warning message for AVIC IPI invalid target"
Linus Torvalds [Tue, 15 Feb 2022 18:52:05 +0000 (10:52 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/hid/hid
Pull HID fixes from Jiri Kosina:
- memory leak fix for hid-elo driver (Dongliang Mu)
- fix for hangs on newer AMD platforms with amd_sfh-driven hardware
(Basavaraj Natikar )
- locking fix in i2c-hid (Daniel Thompson)
- a few device-ID specific quirks
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
HID: amd_sfh: Add interrupt handler to process interrupts
HID: amd_sfh: Add functionality to clear interrupts
HID: amd_sfh: Disable the interrupt for all command
HID: amd_sfh: Correct the structure field name
HID: amd_sfh: Handle amd_sfh work buffer in PM ops
HID:Add support for UGTABLET WP5540
HID: amd_sfh: Add illuminance mask to limit ALS max value
HID: amd_sfh: Increase sensor command timeout
HID: i2c-hid: goodix: Fix a lockdep splat
HID: elo: fix memory leak in elo_probe
HID: apple: Set the tilde quirk flag on the Wellspring 5 and later
Linus Torvalds [Tue, 15 Feb 2022 17:14:05 +0000 (09:14 -0800)]
Merge tag 'for-5.17-rc4-tag' of git://git./linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- yield CPU more often when defragmenting a large file
- skip defragmenting extents already under writeback
- improve error message when send fails to write file data
- get rid of warning when mounted with 'flushoncommit'
* tag 'for-5.17-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: send: in case of IO error log it
btrfs: get rid of warning on transaction commit when using flushoncommit
btrfs: defrag: don't try to defrag extents which are under writeback
btrfs: don't hold CPU for too long when defragging a file
Linus Torvalds [Tue, 15 Feb 2022 17:10:09 +0000 (09:10 -0800)]
Merge tag 'for-5.17/parisc-3' of git://git./linux/kernel/git/deller/parisc-linux
Pull parisc architecture fixes from Helge Deller:
- Fix miscompilations when function calls are made from inside a
put_user() call
- Drop __init from map_pages() declaration to avoid random boot crashes
- Added #error messages if a 64-bit compiler was used to build a 32-bit
kernel (and vice versa)
- Fix out-of-bound data TLB miss faults in sba_iommu and ccio-dma
drivers
- Add ioread64_lo_hi() and iowrite64_lo_hi() functions to avoid kernel
test robot errors
- Fix link failure when 8250_gsc driver is built without CONFIG_IOSAPIC
* tag 'for-5.17/parisc-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
serial: parisc: GSC: fix build when IOSAPIC is not set
parisc: Fix some apparent put_user() failures
parisc: Show error if wrong 32/64-bit compiler is being used
parisc: Add ioread64_lo_hi() and iowrite64_lo_hi()
parisc: Fix sglist access in ccio-dma.c
parisc: Fix data TLB miss in sba_unmap_sg
parisc: Drop __init from map_pages declaration