platform/kernel/linux-starfive.git
2 years agomm/gup: Add try_get_folio() and try_grab_folio()
Matthew Wilcox (Oracle) [Fri, 4 Feb 2022 15:27:40 +0000 (10:27 -0500)]
mm/gup: Add try_get_folio() and try_grab_folio()

Convert try_get_compound_head() into try_get_folio() and convert
try_grab_compound_head() into try_grab_folio().  Add a temporary
try_grab_compound_head() wrapper around try_grab_folio() to let us
convert callers individually.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2 years agomm: Turn page_maybe_dma_pinned() into folio_maybe_dma_pinned()
Matthew Wilcox (Oracle) [Mon, 27 Dec 2021 23:40:41 +0000 (18:40 -0500)]
mm: Turn page_maybe_dma_pinned() into folio_maybe_dma_pinned()

Replace three calls to compound_head() with one.  This removes the last
user of compound_pincount(), so remove that helper too.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2 years agomm: Add folio_pincount_ptr()
Matthew Wilcox (Oracle) [Mon, 27 Dec 2021 23:28:58 +0000 (18:28 -0500)]
mm: Add folio_pincount_ptr()

This is the folio equivalent of compound_pincount_ptr().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2 years agomm: Make compound_pincount always available
Matthew Wilcox (Oracle) [Thu, 6 Jan 2022 21:46:43 +0000 (16:46 -0500)]
mm: Make compound_pincount always available

Move compound_pincount from the third page to the second page, which
means it's available for all compound pages.  That lets us delete
hpage_pincount_available().

On 32-bit systems, there isn't enough space for both compound_pincount
and compound_nr in the second page (it would collide with page->private,
which is in use for pages in the swap cache), so revert the optimisation
of storing both compound_order and compound_nr on 32-bit systems.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2 years agomm/gup: Remove hpage_pincount_sub()
Matthew Wilcox (Oracle) [Fri, 7 Jan 2022 19:19:39 +0000 (14:19 -0500)]
mm/gup: Remove hpage_pincount_sub()

Move the assertion (and correct it to be a cheaper variant),
and inline the atomic_sub() operation.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2 years agomm/gup: Remove hpage_pincount_add()
Matthew Wilcox (Oracle) [Fri, 7 Jan 2022 19:15:11 +0000 (14:15 -0500)]
mm/gup: Remove hpage_pincount_add()

It's clearer to call atomic_add() in the callers; the assertions clearly
can't fire there because they're part of the condition for calling
atomic_add().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2 years agomm/gup: Handle page split race more efficiently
Matthew Wilcox (Oracle) [Fri, 7 Jan 2022 19:04:55 +0000 (14:04 -0500)]
mm/gup: Handle page split race more efficiently

If we hit the page split race, the current code returns NULL which will
presumably trigger a retry under the mmap_lock.  This isn't necessary;
we can just retry the compound_head() lookup.  This is a very minor
optimisation of an unlikely path, but conceptually it matches (eg)
the page cache RCU-protected lookup.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2 years agomm/gup: Remove an assumption of a contiguous memmap
Matthew Wilcox (Oracle) [Fri, 7 Jan 2022 18:45:25 +0000 (13:45 -0500)]
mm/gup: Remove an assumption of a contiguous memmap

This assumption needs the inverse of nth_page(), which is temporarily
named page_nth() until it's renamed later in this series.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2 years agomm/gup: Fix some contiguous memmap assumptions
Matthew Wilcox (Oracle) [Fri, 7 Jan 2022 18:25:55 +0000 (13:25 -0500)]
mm/gup: Fix some contiguous memmap assumptions

Several functions in gup.c assume that a compound page has virtually
contiguous page structs.  This isn't true for SPARSEMEM configs unless
SPARSEMEM_VMEMMAP is also set.  Fix them by using nth_page() instead of
plain pointer arithmetic.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2 years agomm/gup: Change the calling convention for compound_next()
Matthew Wilcox (Oracle) [Mon, 10 Jan 2022 02:03:47 +0000 (21:03 -0500)]
mm/gup: Change the calling convention for compound_next()

Return the head page instead of storing it to a passed parameter.
Reorder the arguments to match the calling function's arguments.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2 years agomm/gup: Optimise compound_range_next()
Matthew Wilcox (Oracle) [Sun, 9 Jan 2022 21:21:23 +0000 (16:21 -0500)]
mm/gup: Optimise compound_range_next()

By definition, a compound page has an order >= 1, so the second half
of the test was redundant.  Also, this cannot be a tail page since
it's the result of calling compound_head(), so use PageHead() instead
of PageCompound().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2 years agomm/gup: Change the calling convention for compound_range_next()
Matthew Wilcox (Oracle) [Sun, 9 Jan 2022 21:05:11 +0000 (16:05 -0500)]
mm/gup: Change the calling convention for compound_range_next()

Return the head page instead of storing it to a passed parameter.
Pass the start page directly instead of passing a pointer to it.
Reorder the arguments to match the calling function's arguments.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2 years agomm/gup: Remove for_each_compound_head()
Matthew Wilcox (Oracle) [Sun, 9 Jan 2022 01:23:46 +0000 (20:23 -0500)]
mm/gup: Remove for_each_compound_head()

This macro doesn't simplify the users; it's easier to just call
compound_next() inside a standard loop.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2 years agomm/gup: Remove for_each_compound_range()
Matthew Wilcox (Oracle) [Sun, 9 Jan 2022 01:23:46 +0000 (20:23 -0500)]
mm/gup: Remove for_each_compound_range()

This macro doesn't simplify the users; it's easier to just call
compound_range_next() inside the loop.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2 years agomm/gup: Increment the page refcount before the pincount
Matthew Wilcox (Oracle) [Fri, 4 Feb 2022 14:24:26 +0000 (09:24 -0500)]
mm/gup: Increment the page refcount before the pincount

We should always increase the refcount before doing anything else to
the page so that other page users see the elevated refcount first.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2 years agomm: build migrate_vma_* for all configs with ZONE_DEVICE support
Christoph Hellwig [Wed, 16 Feb 2022 04:31:38 +0000 (15:31 +1100)]
mm: build migrate_vma_* for all configs with ZONE_DEVICE support

This code will be used for device coherent memory as well in a bit,
so relax the ifdef a bit.

Link: https://lkml.kernel.org/r/20220210072828.2930359-15-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: move the migrate_vma_* device migration code into its own file
Christoph Hellwig [Wed, 16 Feb 2022 04:31:38 +0000 (15:31 +1100)]
mm: move the migrate_vma_* device migration code into its own file

Split the code used to migrate to and from ZONE_DEVICE memory from
migrate.c into a new file.

Link: https://lkml.kernel.org/r/20220210072828.2930359-14-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: refactor the ZONE_DEVICE handling in migrate_vma_pages
Christoph Hellwig [Wed, 16 Feb 2022 04:31:37 +0000 (15:31 +1100)]
mm: refactor the ZONE_DEVICE handling in migrate_vma_pages

Make the flow a little more clear and prepare for adding a new
ZONE_DEVICE memory type.

Link: https://lkml.kernel.org/r/20220210072828.2930359-13-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: refactor the ZONE_DEVICE handling in migrate_vma_insert_page
Christoph Hellwig [Wed, 16 Feb 2022 04:31:37 +0000 (15:31 +1100)]
mm: refactor the ZONE_DEVICE handling in migrate_vma_insert_page

Make the flow a little more clear and prepare for adding a new
ZONE_DEVICE memory type.

Link: https://lkml.kernel.org/r/20220210072828.2930359-12-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: refactor check_and_migrate_movable_pages
Christoph Hellwig [Wed, 16 Feb 2022 04:31:37 +0000 (15:31 +1100)]
mm: refactor check_and_migrate_movable_pages

Remove up to two levels of indentation by using continue statements
and move variables to local scope where possible.

Link: https://lkml.kernel.org/r/20220210072828.2930359-11-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: generalize the pgmap based page_free infrastructure
Christoph Hellwig [Wed, 16 Feb 2022 04:31:37 +0000 (15:31 +1100)]
mm: generalize the pgmap based page_free infrastructure

Key off on the existence of ->page_free to prepare for adding support for
more pgmap types that are device managed and thus need the free callback.

Link: https://lkml.kernel.org/r/20220210072828.2930359-10-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agofsdax: depend on ZONE_DEVICE || FS_DAX_LIMITED
Christoph Hellwig [Wed, 16 Feb 2022 04:31:37 +0000 (15:31 +1100)]
fsdax: depend on ZONE_DEVICE || FS_DAX_LIMITED

Add a depends on ZONE_DEVICE support or the s390-specific limited DAX
support, as one of the two is required at runtime for fsdax code to
actually work.

Link: https://lkml.kernel.org/r/20220210072828.2930359-9-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: remove the extra ZONE_DEVICE struct page refcount
Christoph Hellwig [Wed, 16 Feb 2022 04:31:36 +0000 (15:31 +1100)]
mm: remove the extra ZONE_DEVICE struct page refcount

ZONE_DEVICE struct pages have an extra reference count that complicates
the code for put_page() and several places in the kernel that need to
check the reference count to see that a page is not being used (gup,
compaction, migration, etc.). Clean up the code so the reference count
doesn't need to be treated specially for ZONE_DEVICE pages.

Note that this excludes the special idle page wakeup for fsdax pages,
which still happens at refcount 1.  This is a separate issue and will
be sorted out later.  Given that only fsdax pages require the
notifiacation when the refcount hits 1 now, the PAGEMAP_OPS Kconfig
symbol can go away and be replaced with a FS_DAX check for this hook
in the put_page fastpath.

Based on an earlier patch from Ralph Campbell <rcampbell@nvidia.com>.

Link: https://lkml.kernel.org/r/20220210072828.2930359-8-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: don't include <linux/memremap.h> in <linux/mm.h>
Christoph Hellwig [Wed, 16 Feb 2022 04:31:36 +0000 (15:31 +1100)]
mm: don't include <linux/memremap.h> in <linux/mm.h>

Move the check for the actual pgmap types that need the free at refcount
one behavior into the out of line helper, and thus avoid the need to
pull memremap.h into mm.h.

Link: https://lkml.kernel.org/r/20220210072828.2930359-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: simplify freeing of devmap managed pages
Christoph Hellwig [Wed, 16 Feb 2022 04:31:35 +0000 (15:31 +1100)]
mm: simplify freeing of devmap managed pages

Make put_devmap_managed_page return if it took charge of the page
or not and remove the separate page_is_devmap_managed helper.

Link: https://lkml.kernel.org/r/20220210072828.2930359-6-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: move free_devmap_managed_page to memremap.c
Christoph Hellwig [Wed, 16 Feb 2022 04:31:35 +0000 (15:31 +1100)]
mm: move free_devmap_managed_page to memremap.c

free_devmap_managed_page has nothing to do with the code in swap.c,
move it to live with the rest of the code for devmap handling.

Link: https://lkml.kernel.org/r/20220210072828.2930359-5-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: remove pointless includes from <linux/hmm.h>
Christoph Hellwig [Wed, 16 Feb 2022 04:31:35 +0000 (15:31 +1100)]
mm: remove pointless includes from <linux/hmm.h>

hmm.h pulls in the world for no good reason at all.  Remove the
includes and push a few ones into the users instead.

Link: https://lkml.kernel.org/r/20220210072828.2930359-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: remove the __KERNEL__ guard from <linux/mm.h>
Christoph Hellwig [Wed, 16 Feb 2022 04:31:35 +0000 (15:31 +1100)]
mm: remove the __KERNEL__ guard from <linux/mm.h>

__KERNEL__ ifdefs don't make sense outside of include/uapi/.

Link: https://lkml.kernel.org/r/20220210072828.2930359-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: remove a pointless CONFIG_ZONE_DEVICE check in memremap_pages
Christoph Hellwig [Wed, 16 Feb 2022 04:31:35 +0000 (15:31 +1100)]
mm: remove a pointless CONFIG_ZONE_DEVICE check in memremap_pages

Patch series "start sorting out the ZONE_DEVICE refcount mess", v2.

This series removes the offset by one refcount for ZONE_DEVICE pages that
are freed back to the driver owning them, which is just device private
ones for now, but also the planned device coherent pages and the ehanced
p2p ones pending.

It does not address the fsdax pages yet, which will be attacked in a
follow on series.

This patch (of 27):

memremap.c is only built when CONFIG_ZONE_DEVICE is set, so remove
the superflous extra check.

Link: https://lkml.kernel.org/r/20220210072828.2930359-1-hch@lst.de
Link: https://lkml.kernel.org/r/20220210072828.2930359-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/munlock: mlock_vma_page() check against VM_SPECIAL
Hugh Dickins [Thu, 3 Mar 2022 01:35:30 +0000 (17:35 -0800)]
mm/munlock: mlock_vma_page() check against VM_SPECIAL

Although mmap_region() and mlock_fixup() take care that VM_LOCKED
is never left set on a VM_SPECIAL vma, there is an interval while
file->f_op->mmap() is using vm_insert_page(s), when VM_LOCKED may
still be set while VM_SPECIAL bits are added: so mlock_vma_page()
should ignore VM_LOCKED while any VM_SPECIAL bits are set.

This showed up as a "Bad page" still mlocked, when vfree()ing pages
which had been vm_inserted by remap_vmalloc_range_partial(): while
release_pages() and __page_cache_release(), and so put_page(), catch
pages still mlocked when freeing (and clear_page_mlock() caught them
when unmapping), the vfree() path is unprepared for them: fix it?
but these pages should not have been mlocked in the first place.

I assume that an mlockall(MCL_FUTURE) had been done in the past; or
maybe the user got to specify MAP_LOCKED on a vmalloc'ing driver mmap.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/thp: shrink_page_list() avoid splitting VM_LOCKED THP
Hugh Dickins [Tue, 15 Feb 2022 02:42:33 +0000 (18:42 -0800)]
mm/thp: shrink_page_list() avoid splitting VM_LOCKED THP

4.8 commit 7751b2da6be0 ("vmscan: split file huge pages before paging
them out") inserted a split_huge_page_to_list() into shrink_page_list()
without considering the mlock case: no problem if the page has already
been marked as Mlocked (the !page_evictable check much higher up will
have skipped all this), but it has always been the case that races or
omissions in setting Mlocked can rely on page reclaim to detect this
and correct it before actually reclaiming - and that remains so, but
what a shame if a hugepage is needlessly split before discovering it.

It is surprising that page_check_references() returns PAGEREF_RECLAIM
when VM_LOCKED, but there was a good reason for that: try_to_unmap_one()
is where the condition is detected and corrected; and until now it could
not be done in page_referenced_one(), because that does not always have
the page locked.  Now that mlock's requirement for page lock has gone,
copy try_to_unmap_one()'s mlock restoration into page_referenced_one(),
and let page_check_references() return PAGEREF_ACTIVATE in this case.

But page_referenced_one() may find a pte mapping one part of a hugepage:
what hold should a pte mapped in a VM_LOCKED area exert over the entire
huge page?  That's debatable.  The approach taken here is to treat that
pte mapping in page_referenced_one() as if not VM_LOCKED, and if no
VM_LOCKED pmd mapping is found later in the walk, and lack of reference
permits, then PAGEREF_RECLAIM take it to attempted splitting as before.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/thp: collapse_file() do try_to_unmap(TTU_BATCH_FLUSH)
Hugh Dickins [Tue, 15 Feb 2022 02:40:55 +0000 (18:40 -0800)]
mm/thp: collapse_file() do try_to_unmap(TTU_BATCH_FLUSH)

collapse_file() is using unmap_mapping_pages(1) on each small page found
mapped, unlike others (reclaim, migration, splitting, memory-failure) who
use try_to_unmap().  There are four advantages to try_to_unmap(): first,
its TTU_IGNORE_MLOCK option now avoids leaving mlocked page in pagevec;
second, its vma lookup uses i_mmap_lock_read() not i_mmap_lock_write();
third, it breaks out early if page is not mapped everywhere it might be;
fourth, its TTU_BATCH_FLUSH option can be used, as in page reclaim, to
save up all the TLB flushing until all of the pages have been unmapped.

Wild guess: perhaps it was originally written to use try_to_unmap(),
but hit the VM_BUG_ON_PAGE(page_mapped) after unmapping, because without
TTU_SYNC it may skip page table locks; but unmap_mapping_pages() never
skips them, so fixed the issue.  I did once hit that VM_BUG_ON_PAGE()
since making this change: we could pass TTU_SYNC here, but I think just
delete the check - the race is very rare, this is an ordinary small page
so we don't need to be so paranoid about mapcount surprises, and the
page_ref_freeze() just below already handles the case adequately.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/munlock: page migration needs mlock pagevec drained
Hugh Dickins [Tue, 15 Feb 2022 02:38:47 +0000 (18:38 -0800)]
mm/munlock: page migration needs mlock pagevec drained

Page migration of a VM_LOCKED page tends to fail, because when the old
page is unmapped, it is put on the mlock pagevec with raised refcount,
which then fails the freeze.

At first I thought this would be fixed by a local mlock_page_drain() at
the upper rmap_walk() level - which would have nicely batched all the
munlocks of that page; but tests show that the task can too easily move
to another cpu, leaving pagevec residue behind which fails the migration.

So try_to_migrate_one() drain the local pagevec after page_remove_rmap()
from a VM_LOCKED vma; and do the same in try_to_unmap_one(), whose
TTU_IGNORE_MLOCK users would want the same treatment; and do the same
in remove_migration_pte() - not important when successfully inserting
a new page, but necessary when hoping to retry after failure.

Any new pagevec runs the risk of adding a new way of stranding, and we
might discover other corners where mlock_page_drain() or lru_add_drain()
would now help.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/munlock: mlock_page() munlock_page() batch by pagevec
Hugh Dickins [Tue, 15 Feb 2022 02:37:29 +0000 (18:37 -0800)]
mm/munlock: mlock_page() munlock_page() batch by pagevec

A weakness of the page->mlock_count approach is the need for lruvec lock
while holding page table lock.  That is not an overhead we would allow on
normal pages, but I think acceptable just for pages in an mlocked area.
But let's try to amortize the extra cost by gathering on per-cpu pagevec
before acquiring the lruvec lock.

I have an unverified conjecture that the mlock pagevec might work out
well for delaying the mlock processing of new file pages until they have
got off lru_cache_add()'s pagevec and on to LRU.

The initialization of page->mlock_count is subject to races and awkward:
0 or !!PageMlocked or 1?  Was it wrong even in the implementation before
this commit, which just widens the window?  I haven't gone back to think
it through.  Maybe someone can point out a better way to initialize it.

Bringing lru_cache_add_inactive_or_unevictable()'s mlock initialization
into mm/mlock.c has helped: mlock_new_page(), using the mlock pagevec,
rather than lru_cache_add()'s pagevec.

Experimented with various orderings: the right thing seems to be for
mlock_page() and mlock_new_page() to TestSetPageMlocked before adding to
pagevec, but munlock_page() to leave TestClearPageMlocked to the later
pagevec processing.

Dropped the VM_BUG_ON_PAGE(PageTail)s this time around: they have made
their point, and the thp_nr_page()s already contain a VM_BUG_ON_PGFLAGS()
for that.

This still leaves acquiring lruvec locks under page table lock each time
the pagevec fills (or a THP is added): which I suppose is rather silly,
since they sit on pagevec waiting to be processed long after page table
lock has been dropped; but I'm disinclined to uglify the calling sequence
until some load shows an actual problem with it (nothing wrong with
taking lruvec lock under page table lock, just "nicer" to do it less).

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/munlock: delete smp_mb() from __pagevec_lru_add_fn()
Hugh Dickins [Tue, 15 Feb 2022 02:34:46 +0000 (18:34 -0800)]
mm/munlock: delete smp_mb() from __pagevec_lru_add_fn()

My reading of comment on smp_mb__after_atomic() in __pagevec_lru_add_fn()
says that it can now be deleted; and that remains so when the next patch
is added.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/migrate: __unmap_and_move() push good newpage to LRU
Hugh Dickins [Tue, 15 Feb 2022 02:33:17 +0000 (18:33 -0800)]
mm/migrate: __unmap_and_move() push good newpage to LRU

Compaction, NUMA page movement, THP collapse/split, and memory failure
do isolate unevictable pages from their "LRU", losing the record of
mlock_count in doing so (isolators are likely to use page->lru for their
own private lists, so mlock_count has to be presumed lost).

That's unfortunate, and we should put in some work to correct that: one
can imagine a function to build up the mlock_count again - but it would
require i_mmap_rwsem for read, so be careful where it's called.  Or
page_referenced_one() and try_to_unmap_one() might do that extra work.

But one place that can very easily be improved is page migration's
__unmap_and_move(): a small adjustment to where the successful new page
is put back on LRU, and its mlock_count (if any) is built back up by
remove_migration_ptes().

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/munlock: mlock_pte_range() when mlocking or munlocking
Hugh Dickins [Tue, 15 Feb 2022 02:31:48 +0000 (18:31 -0800)]
mm/munlock: mlock_pte_range() when mlocking or munlocking

Fill in missing pieces: reimplementation of munlock_vma_pages_range(),
required to lower the mlock_counts when munlocking without munmapping;
and its complement, implementation of mlock_vma_pages_range(), required
to raise the mlock_counts on pages already there when a range is mlocked.

Combine them into just the one function mlock_vma_pages_range(), using
walk_page_range() to run mlock_pte_range().  This approach fixes the
"Very slow unlockall()" of unpopulated PROT_NONE areas, reported in
https://lore.kernel.org/linux-mm/70885d37-62b7-748b-29df-9e94f3291736@gmail.com/

Munlock clears VM_LOCKED at the start, under exclusive mmap_lock; but if
a racing truncate or holepunch (depending on i_mmap_rwsem) gets to the
pte first, it will not try to munlock the page: leaving release_pages()
to correct it when the last reference to the page is gone - that's okay,
a page is not evictable anyway while it is held by an extra reference.

Mlock sets VM_LOCKED at the start, under exclusive mmap_lock; but if
a racing remove_migration_pte() or try_to_unmap_one() (depending on
i_mmap_rwsem) gets to the pte first, it will try to mlock the page,
then mlock_pte_range() mlock it a second time.  This is harder to
reproduce, but a more serious race because it could leave the page
unevictable indefinitely though the area is munlocked afterwards.
Guard against it by setting the (inappropriate) VM_IO flag,
and modifying mlock_vma_page() to decline such vmas.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/munlock: maintain page->mlock_count while unevictable
Hugh Dickins [Tue, 15 Feb 2022 02:29:54 +0000 (18:29 -0800)]
mm/munlock: maintain page->mlock_count while unevictable

Previous patches have been preparatory: now implement page->mlock_count.
The ordering of the "Unevictable LRU" is of no significance, and there is
no point holding unevictable pages on a list: place page->mlock_count to
overlay page->lru.prev (since page->lru.next is overlaid by compound_head,
which needs to be even so as not to satisfy PageTail - though 2 could be
added instead of 1 for each mlock, if that's ever an improvement).

But it's only safe to rely on or modify page->mlock_count while lruvec
lock is held and page is on unevictable "LRU" - we can save lots of edits
by continuing to pretend that there's an imaginary LRU here (there is an
unevictable count which still needs to be maintained, but not a list).

The mlock_count technique suffers from an unreliability much like with
page_mlock(): while someone else has the page off LRU, not much can
be done.  As before, err on the safe side (behave as if mlock_count 0),
and let try_to_unlock_one() move the page to unevictable if reclaim finds
out later on - a few misplaced pages don't matter, what we want to avoid
is imbalancing reclaim by flooding evictable lists with unevictable pages.

I am not a fan of "if (!isolate_lru_page(page)) putback_lru_page(page);":
if we have taken lruvec lock to get the page off its present list, then
we save everyone trouble (and however many extra atomic ops) by putting
it on its destination list immediately.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/munlock: replace clear_page_mlock() by final clearance
Hugh Dickins [Tue, 15 Feb 2022 02:28:05 +0000 (18:28 -0800)]
mm/munlock: replace clear_page_mlock() by final clearance

Placing munlock_vma_page() at the end of page_remove_rmap() shifts most
of the munlocking to clear_page_mlock(), since PageMlocked is typically
still set when mapcount has fallen to 0.  That is not what we want: we
want /proc/vmstat's unevictable_pgs_cleared to remain as a useful check
on the integrity of of the mlock/munlock protocol - small numbers are
not surprising, but big numbers mean the protocol is not working.

That could be easily fixed by placing munlock_vma_page() at the start of
page_remove_rmap(); but later in the series we shall want to batch the
munlocking, and that too would tend to leave PageMlocked still set at
the point when it is checked.

So delete clear_page_mlock() now: leave it instead to release_pages()
(and __page_cache_release()) to do this backstop clearing of Mlocked,
when page refcount has fallen to 0.  If a pinned page occasionally gets
counted as Mlocked and Unevictable until it is unpinned, that's okay.

A slightly regrettable side-effect of this change is that, since
release_pages() and __page_cache_release() may be called at interrupt
time, those places which update NR_MLOCK with interrupts enabled
had better use mod_zone_page_state() than __mod_zone_page_state()
(but holding the lruvec lock always has interrupts disabled).

This change, forcing Mlocked off when refcount 0 instead of earlier
when mapcount 0, is not fundamental: it can be reversed if performance
or something else is found to suffer; but this is the easiest way to
separate the stats - let's not complicate that without good reason.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/munlock: rmap call mlock_vma_page() munlock_vma_page()
Hugh Dickins [Tue, 15 Feb 2022 02:26:39 +0000 (18:26 -0800)]
mm/munlock: rmap call mlock_vma_page() munlock_vma_page()

Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
inline functions which check (vma->vm_flags & VM_LOCKED) before calling
mlock_page() and munlock_page() in mm/mlock.c.

Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
because we have understandable difficulty in accounting pte maps of THPs,
and if passed a PageHead page, mlock_page() and munlock_page() cannot
tell whether it's a pmd map to be counted or a pte map to be ignored.

Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the
others, and use that to call mlock_vma_page() at the end of the page
adds, and munlock_vma_page() at the end of page_remove_rmap() (end or
beginning? unimportant, but end was easier for assertions in testing).

No page lock is required (although almost all adds happen to hold it):
delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s.
Certainly page lock did serialize with page migration, but I'm having
difficulty explaining why that was ever important.

Mlock accounting on THPs has been hard to define, differed between anon
and file, involved PageDoubleMap in some places and not others, required
clear_page_mlock() at some points.  Keep it simple now: just count the
pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.

page_add_new_anon_rmap() callers unchanged: they have long been calling
lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED
handling (it also checks for not VM_SPECIAL: I think that's overcautious,
and inconsistent with other checks, that mmap_region() already prevents
VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it).

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/munlock: delete munlock_vma_pages_all(), allow oomreap
Hugh Dickins [Tue, 15 Feb 2022 02:23:29 +0000 (18:23 -0800)]
mm/munlock: delete munlock_vma_pages_all(), allow oomreap

munlock_vma_pages_range() will still be required, when munlocking but
not munmapping a set of pages; but when unmapping a pte, the mlock count
will be maintained in much the same way as it will be maintained when
mapping in the pte.  Which removes the need for munlock_vma_pages_all()
on mlocked vmas when munmapping or exiting: eliminating the catastrophic
contention on i_mmap_rwsem, and the need for page lock on the pages.

There is still a need to update locked_vm accounting according to the
munmapped vmas when munmapping: do that in detach_vmas_to_be_unmapped().
exit_mmap() does not need locked_vm updates, so delete unlock_range().

And wasn't I the one who forbade the OOM reaper to attack mlocked vmas,
because of the uncertainty in blocking on all those page locks?
No fear of that now, so permit the OOM reaper on mlocked vmas.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/munlock: delete FOLL_MLOCK and FOLL_POPULATE
Hugh Dickins [Tue, 15 Feb 2022 02:21:52 +0000 (18:21 -0800)]
mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE

If counting page mlocks, we must not double-count: follow_page_pte() can
tell if a page has already been Mlocked or not, but cannot tell if a pte
has already been counted or not: that will have to be done when the pte
is mapped in (which lru_cache_add_inactive_or_unevictable() already tracks
for new anon pages, but there's no such tracking yet for others).

Delete all the FOLL_MLOCK code - faulting in the missing pages will do
all that is necessary, without special mlock_vma_page() calls from here.

But then FOLL_POPULATE turns out to serve no purpose - it was there so
that its absence would tell faultin_page() not to faultin page when
setting up VM_LOCKONFAULT areas; but if there's no special work needed
here for mlock, then there's no work at all here for VM_LOCKONFAULT.

Have I got that right?  I've not looked into the history, but see that
FOLL_POPULATE goes back before VM_LOCKONFAULT: did it serve a different
purpose before?  Ah, yes, it was used to skip the old stack guard page.

And is it intentional that COW is not broken on existing pages when
setting up a VM_LOCKONFAULT area?  I can see that being argued either
way, and have no reason to disagree with current behaviour.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/munlock: delete page_mlock() and all its works
Hugh Dickins [Tue, 15 Feb 2022 02:20:24 +0000 (18:20 -0800)]
mm/munlock: delete page_mlock() and all its works

We have recommended some applications to mlock their userspace, but that
turns out to be counter-productive: when many processes mlock the same
file, contention on rmap's i_mmap_rwsem can become intolerable at exit: it
is needed for write, to remove any vma mapping that file from rmap's tree;
but hogged for read by those with mlocks calling page_mlock() (formerly
known as try_to_munlock()) on *each* page mapped from the file (the
purpose being to find out whether another process has the page mlocked,
so therefore it should not be unmlocked yet).

Several optimizations have been made in the past: one is to skip
page_mlock() when mapcount tells that nothing else has this page
mapped; but that doesn't help at all when others do have it mapped.
This time around, I initially intended to add a preliminary search
of the rmap tree for overlapping VM_LOCKED ranges; but that gets
messy with locking order, when in doubt whether a page is actually
present; and risks adding even more contention on the i_mmap_rwsem.

A solution would be much easier, if only there were space in struct page
for an mlock_count... but actually, most of the time, there is space for
it - an mlocked page spends most of its life on an unevictable LRU, but
since 3.18 removed the scan_unevictable_pages sysctl, that "LRU" has
been redundant.  Let's try to reuse its page->lru.

But leave that until a later patch: in this patch, clear the ground by
removing page_mlock(), and all the infrastructure that has gathered
around it - which mostly hinders understanding, and will make reviewing
new additions harder.  Don't mind those old comments about THPs, they
date from before 4.5's refcounting rework: splitting is not a risk here.

Just keep a minimal version of munlock_vma_page(), as reminder of what it
should attend to (in particular, the odd way PGSTRANDED is counted out of
PGMUNLOCKED), and likewise a stub for munlock_vma_pages_range().  Move
unchanged __mlock_posix_error_return() out of the way, down to above its
caller: this series then makes no further change after mlock_fixup().

After this and each following commit, the kernel builds, boots and runs;
but with deficiencies which may show up in testing of mlock and munlock.
The system calls succeed or fail as before, and mlock remains effective
in preventing page reclaim; but meminfo's Unevictable and Mlocked amounts
may be shown too low after mlock, grow, then stay too high after munlock:
with previously mlocked pages remaining unevictable for too long, until
finally unmapped and freed and counts corrected. Normal service will be
resumed in "mm/munlock: mlock_pte_range() when mlocking or munlocking".

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agoMerge tag 'mmc-v5.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Linus Torvalds [Wed, 16 Feb 2022 20:09:22 +0000 (12:09 -0800)]
Merge tag 'mmc-v5.17-rc1-2' of git://git./linux/kernel/git/ulfh/mmc

Pull MMC fix from Ulf Hansson:
 "Fix recovery logic for multi block I/O reads (MMC_READ_MULTIPLE_BLOCK)"

* tag 'mmc-v5.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
  mmc: block: fix read single on recovery logic

2 years agotty: n_tty: do not look ahead for EOL character past the end of the buffer
Linus Torvalds [Tue, 15 Feb 2022 23:28:00 +0000 (15:28 -0800)]
tty: n_tty: do not look ahead for EOL character past the end of the buffer

Daniel Gibson reports that the n_tty code gets line termination wrong in
very specific cases:

 "If you feed a line with exactly 64 chars + terminating newline, and
  directly afterwards (without reading) another line into a pseudo
  terminal, the the first read() on the other side will return the 64
  char line *without* terminating newline, and the next read() will
  return the missing terminating newline AND the complete next line (if
  it fits in the buffer)"

and bisected the behavior to commit 3b830a9c34d5 ("tty: convert
tty_ldisc_ops 'read()' function to take a kernel pointer").

Now, digging deeper, it turns out that the behavior isn't exactly new:
what changed in commit 3b830a9c34d5 was that the tty line discipline
.read() function is now passed an intermediate kernel buffer rather than
the final user space buffer.

And that intermediate kernel buffer is 64 bytes in size - thus that
special case with exactly 64 bytes plus terminating newline.

The same problem did exist before, but historically the boundary was not
the 64-byte chunk, but the user-supplied buffer size, which is obviously
generally bigger (and potentially bigger than N_TTY_BUF_SIZE, which
would hide the issue entirely).

The reason is that the n_tty canon_copy_from_read_buf() code would look
ahead for the EOL character one byte further than it would actually
copy.  It would then decide that it had found the terminator, and unmark
it as an EOL character - which in turn explains why the next read
wouldn't then be terminated by it.

Now, the reason it did all this in the first place is related to some
historical and pretty obscure EOF behavior, see commit ac8f3bf8832a
("n_tty: Fix poll() after buffer-limited eof push read") and commit
40d5e0905a03 ("n_tty: Fix EOF push handling").

And the reason for the EOL confusion is that we treat EOF as a special
EOL condition, with the EOL character being NUL (aka "__DISABLED_CHAR"
in the kernel sources).

So that EOF look-ahead also affects the normal EOL handling.

This patch just removes the look-ahead that causes problems, because EOL
is much more critical than the historical "EOF in the middle of a line
that coincides with the end of the buffer" handling ever was.

Now, it is possible that we should indeed re-introduce the "look at next
character to see if it's a EOF" behavior, but if so, that should be done
not at the kernel buffer chunk boundary in canon_copy_from_read_buf(),
but at a higher level, when we run out of the user buffer.

In particular, the place to do that would be at the top of
'n_tty_read()', where we check if it's a continuation of a previously
started read, and there is no more buffer space left, we could decide to
just eat the __DISABLED_CHAR at that point.

But that would be a separate patch, because I suspect nobody actually
cares, and I'd like to get a report about it before bothering.

Fixes: 3b830a9c34d5 ("tty: convert tty_ldisc_ops 'read()' function to take a kernel pointer")
Fixes: ac8f3bf8832a ("n_tty: Fix  poll() after buffer-limited eof push read")
Fixes: 40d5e0905a03 ("n_tty: Fix EOF push handling")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215611
Reported-and-tested-by: Daniel Gibson <metalcaedes@gmail.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Linus Torvalds [Tue, 15 Feb 2022 19:07:59 +0000 (11:07 -0800)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "ARM:

   - Read HW interrupt pending state from the HW

  x86:

   - Don't truncate the performance event mask on AMD

   - Fix Xen runstate updates to be atomic when preempting vCPU

   - Fix for AMD AVIC interrupt injection race

   - Several other AMD fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86/pmu: Use AMD64_RAW_EVENT_MASK for PERF_TYPE_RAW
  KVM: x86/pmu: Don't truncate the PerfEvtSeln MSR when creating a perf event
  KVM: SVM: fix race between interrupt delivery and AVIC inhibition
  KVM: SVM: set IRR in svm_deliver_interrupt
  KVM: SVM: extract avic_ring_doorbell
  selftests: kvm: Remove absent target file
  KVM: arm64: vgic: Read HW interrupt pending state from the HW
  KVM: x86/xen: Fix runstate updates to be atomic when preempting vCPU
  KVM: x86: SVM: move avic definitions from AMD's spec to svm.h
  KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it
  KVM: x86: nSVM: deal with L1 hypervisor that intercepts interrupts but lets L2 control them
  KVM: x86: nSVM: expose clean bit support to the guest
  KVM: x86: nSVM/nVMX: set nested_run_pending on VM entry which is a result of RSM
  KVM: x86: nSVM: mark vmcb01 as dirty when restoring SMM saved state
  KVM: x86: nSVM: fix potential NULL derefernce on nested migration
  KVM: x86: SVM: don't passthrough SMAP/SMEP/PKE bits in !NPT && !gCR0.PG case
  Revert "svm: Add warning message for AVIC IPI invalid target"

2 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid
Linus Torvalds [Tue, 15 Feb 2022 18:52:05 +0000 (10:52 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/hid/hid

Pull HID fixes from Jiri Kosina:

 - memory leak fix for hid-elo driver (Dongliang Mu)

 - fix for hangs on newer AMD platforms with amd_sfh-driven hardware
   (Basavaraj Natikar )

 - locking fix in i2c-hid (Daniel Thompson)

 - a few device-ID specific quirks

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
  HID: amd_sfh: Add interrupt handler to process interrupts
  HID: amd_sfh: Add functionality to clear interrupts
  HID: amd_sfh: Disable the interrupt for all command
  HID: amd_sfh: Correct the structure field name
  HID: amd_sfh: Handle amd_sfh work buffer in PM ops
  HID:Add support for UGTABLET WP5540
  HID: amd_sfh: Add illuminance mask to limit ALS max value
  HID: amd_sfh: Increase sensor command timeout
  HID: i2c-hid: goodix: Fix a lockdep splat
  HID: elo: fix memory leak in elo_probe
  HID: apple: Set the tilde quirk flag on the Wellspring 5 and later

2 years agoMerge tag 'for-5.17-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave...
Linus Torvalds [Tue, 15 Feb 2022 17:14:05 +0000 (09:14 -0800)]
Merge tag 'for-5.17-rc4-tag' of git://git./linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - yield CPU more often when defragmenting a large file

 - skip defragmenting extents already under writeback

 - improve error message when send fails to write file data

 - get rid of warning when mounted with 'flushoncommit'

* tag 'for-5.17-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: send: in case of IO error log it
  btrfs: get rid of warning on transaction commit when using flushoncommit
  btrfs: defrag: don't try to defrag extents which are under writeback
  btrfs: don't hold CPU for too long when defragging a file

2 years agoMerge tag 'for-5.17/parisc-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller...
Linus Torvalds [Tue, 15 Feb 2022 17:10:09 +0000 (09:10 -0800)]
Merge tag 'for-5.17/parisc-3' of git://git./linux/kernel/git/deller/parisc-linux

Pull parisc architecture fixes from Helge Deller:

 - Fix miscompilations when function calls are made from inside a
   put_user() call

 - Drop __init from map_pages() declaration to avoid random boot crashes

 - Added #error messages if a 64-bit compiler was used to build a 32-bit
   kernel (and vice versa)

 - Fix out-of-bound data TLB miss faults in sba_iommu and ccio-dma
   drivers

 - Add ioread64_lo_hi() and iowrite64_lo_hi() functions to avoid kernel
   test robot errors

 - Fix link failure when 8250_gsc driver is built without CONFIG_IOSAPIC

* tag 'for-5.17/parisc-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  serial: parisc: GSC: fix build when IOSAPIC is not set
  parisc: Fix some apparent put_user() failures
  parisc: Show error if wrong 32/64-bit compiler is being used
  parisc: Add ioread64_lo_hi() and iowrite64_lo_hi()
  parisc: Fix sglist access in ccio-dma.c
  parisc: Fix data TLB miss in sba_unmap_sg
  parisc: Drop __init from map_pages declaration

2 years agoMerge tag 'hyperv-fixes-signed-20220215' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Tue, 15 Feb 2022 17:05:01 +0000 (09:05 -0800)]
Merge tag 'hyperv-fixes-signed-20220215' of git://git./linux/kernel/git/hyperv/linux

Pull hyperv fixes from Wei Liu:

 - Rework use of DMA_BIT_MASK in vmbus to work around a clang bug
   (Michael Kelley)

 - Fix NUMA topology (Long Li)

 - Fix a memory leak in vmbus (Miaoqian Lin)

 - One minor clean-up patch (Cai Huoqing)

* tag 'hyperv-fixes-signed-20220215' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  Drivers: hv: utils: Make use of the helper macro LIST_HEAD()
  Drivers: hv: vmbus: Rework use of DMA_BIT_MASK(64)
  Drivers: hv: vmbus: Fix memory leak in vmbus_add_channel_kobj
  PCI: hv: Fix NUMA node assignment when kernel boots with custom NUMA topology

2 years agoserial: parisc: GSC: fix build when IOSAPIC is not set
Randy Dunlap [Mon, 14 Feb 2022 18:00:19 +0000 (10:00 -0800)]
serial: parisc: GSC: fix build when IOSAPIC is not set

There is a build error when using a kernel .config file from
'kernel test robot' for a different build problem:

hppa64-linux-ld: drivers/tty/serial/8250/8250_gsc.o: in function `.LC3':
(.data.rel.ro+0x18): undefined reference to `iosapic_serial_irq'

when:
  CONFIG_GSC=y
  CONFIG_SERIO_GSCPS2=y
  CONFIG_SERIAL_8250_GSC=y
  CONFIG_PCI is not set
    and hence PCI_LBA is not set.
  IOSAPIC depends on PCI_LBA, so IOSAPIC is not set/enabled.

Make the use of iosapic_serial_irq() conditional to fix the build error.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: kernel test robot <lkp@intel.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: linux-parisc@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: linux-serial@vger.kernel.org
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Johan Hovold <johan@kernel.org>
Suggested-by: Helge Deller <deller@gmx.de>
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: stable@vger.kernel.org
Signed-off-by: Helge Deller <deller@gmx.de>
2 years agoMerge tag 'regulator-fix-v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Mon, 14 Feb 2022 17:51:26 +0000 (09:51 -0800)]
Merge tag 'regulator-fix-v5.17-rc4' of git://git./linux/kernel/git/broonie/regulator

Pull regulator fix from Mark Brown:
 "One fix here, for initialisation of regulators that don't have an
  in_enabled() operation which would mainly impact cases where they
  aren't otherwise used during early setup for some reason"

* tag 'regulator-fix-v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
  regulator: core: fix false positive in regulator_late_cleanup()

2 years agoHID: amd_sfh: Add interrupt handler to process interrupts
Basavaraj Natikar [Tue, 8 Feb 2022 12:21:12 +0000 (17:51 +0530)]
HID: amd_sfh: Add interrupt handler to process interrupts

On newer AMD platforms with SFH, it is observed that random interrupts
get generated on the SFH hardware and until this is cleared the firmware
sensor processing is stalled, resulting in no data been received to
driver side.

Add routines to handle these interrupts, so that firmware operations are
not stalled.

Signed-off-by: Basavaraj Natikar <Basavaraj.Natikar@amd.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2 years agoHID: amd_sfh: Add functionality to clear interrupts
Basavaraj Natikar [Tue, 8 Feb 2022 12:21:11 +0000 (17:51 +0530)]
HID: amd_sfh: Add functionality to clear interrupts

Newer AMD platforms with SFH may generate interrupts on some events
which are unwarranted. Until this is cleared the actual MP2 data
processing maybe stalled in some cases.

Add a mechanism to clear the pending interrupts (if any) during the
driver initialization and sensor command operations.

Signed-off-by: Basavaraj Natikar <Basavaraj.Natikar@amd.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2 years agoHID: amd_sfh: Disable the interrupt for all command
Basavaraj Natikar [Tue, 8 Feb 2022 12:21:10 +0000 (17:51 +0530)]
HID: amd_sfh: Disable the interrupt for all command

Sensor data is processed in polling mode. Hence disable the interrupt
for all sensor command.

Signed-off-by: Basavaraj Natikar <Basavaraj.Natikar@amd.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2 years agoHID: amd_sfh: Correct the structure field name
Basavaraj Natikar [Tue, 8 Feb 2022 12:21:09 +0000 (17:51 +0530)]
HID: amd_sfh: Correct the structure field name

Misinterpreted intr_enable field name. Hence correct the structure
field name accordingly to reflect the functionality.

Fixes: f264481ad614 ("HID: amd_sfh: Extend driver capabilities for multi-generation support")
Signed-off-by: Basavaraj Natikar <Basavaraj.Natikar@amd.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2 years agoHID: amd_sfh: Handle amd_sfh work buffer in PM ops
Basavaraj Natikar [Tue, 8 Feb 2022 12:21:08 +0000 (17:51 +0530)]
HID: amd_sfh: Handle amd_sfh work buffer in PM ops

Since in the current amd_sfh design the sensor data is periodically
obtained in the form of poll data, during the suspend/resume cycle,
scheduling a delayed work adds no value.

So, cancel the work and restart back during the suspend/resume cycle
respectively.

Signed-off-by: Basavaraj Natikar <Basavaraj.Natikar@amd.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2 years agoKVM: x86/pmu: Use AMD64_RAW_EVENT_MASK for PERF_TYPE_RAW
Jim Mattson [Thu, 3 Feb 2022 01:48:13 +0000 (17:48 -0800)]
KVM: x86/pmu: Use AMD64_RAW_EVENT_MASK for PERF_TYPE_RAW

AMD's event select is 3 nybbles, with the high nybble in bits 35:32 of
a PerfEvtSeln MSR. Don't mask off the high nybble when configuring a
RAW perf event.

Fixes: ca724305a2b0 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM")
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20220203014813.2130559-2-jmattson@google.com>
Reviewed-by: David Dunn <daviddunn@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: x86/pmu: Don't truncate the PerfEvtSeln MSR when creating a perf event
Jim Mattson [Thu, 3 Feb 2022 01:48:12 +0000 (17:48 -0800)]
KVM: x86/pmu: Don't truncate the PerfEvtSeln MSR when creating a perf event

AMD's event select is 3 nybbles, with the high nybble in bits 35:32 of
a PerfEvtSeln MSR. Don't drop the high nybble when setting up the
config field of a perf_event_attr structure for a call to
perf_event_create_kernel_counter().

Fixes: ca724305a2b0 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM")
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20220203014813.2130559-1-jmattson@google.com>
Reviewed-by: David Dunn <daviddunn@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoparisc: Fix some apparent put_user() failures
Helge Deller [Sun, 13 Feb 2022 21:52:11 +0000 (22:52 +0100)]
parisc: Fix some apparent put_user() failures

After commit 4b9d2a731c3d ("parisc: Switch user access functions
to signal errors in r29 instead of r8") bash suddenly started
to report those warnings after login:

-bash: cannot set terminal process group (-1): Bad file descriptor
-bash: no job control in this shell

It turned out, that a function call inside a put_user(), e.g.:
put_user(vt_do_kdgkbmode(console), (int __user *)arg);
clobbered the error register (r29) and thus the put_user() call itself
seem to have failed.

Rearrange the C-code to pre-calculate the intermediate value
and then do the put_user().
Additionally prefer the "+" constraint on pu_err and gu_err registers
to tell the compiler that those operands are both read and written by
the assembly instruction.

Reported-by: John David Anglin <dave.anglin@bell.net>
Signed-off-by: Helge Deller <deller@gmx.de>
Fixes: 4b9d2a731c3d ("parisc: Switch user access functions to signal errors in r29 instead of r8")
Signed-off-by: Helge Deller <deller@gmx.de>
2 years agoparisc: Show error if wrong 32/64-bit compiler is being used
Helge Deller [Sun, 13 Feb 2022 21:29:25 +0000 (22:29 +0100)]
parisc: Show error if wrong 32/64-bit compiler is being used

It happens quite often that people use the wrong compiler to build the
kernel:

make ARCH=parisc   -> builds the 32-bit kernel
make ARCH=parisc64 -> builds the 64-bit kernel

This patch adds a sanity check which errors out with an instruction how
use the correct ARCH= option.

Signed-off-by: Helge Deller <deller@gmx.de>
Cc: stable@vger.kernel.org # v5.15+
2 years agoLinux 5.17-rc4
Linus Torvalds [Sun, 13 Feb 2022 20:13:30 +0000 (12:13 -0800)]
Linux 5.17-rc4

2 years agoMerge tag 'kbuild-fixes-v5.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 13 Feb 2022 19:58:11 +0000 (11:58 -0800)]
Merge tag 'kbuild-fixes-v5.17-2' of git://git./linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - Fix the truncated path issue for HAVE_GCC_PLUGINS test in Kconfig

 - Move -Wunsligned-access to W=1 builds to avoid sprinkling warnings
   for the latest Clang

 - Fix missing fclose() in Kconfig

 - Fix Kconfig to touch dep headers correctly when KCONFIG_AUTOCONFIG is
   overridden.

* tag 'kbuild-fixes-v5.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kconfig: fix failing to generate auto.conf
  kconfig: fix missing fclose() on error paths
  Makefile.extrawarn: Move -Wunaligned-access to W=1
  kconfig: let 'shell' return enough output for deep path names

2 years agoMerge tag 'irq-urgent-2022-02-13' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 13 Feb 2022 18:06:40 +0000 (10:06 -0800)]
Merge tag 'irq-urgent-2022-02-13' of git://git./linux/kernel/git/tip/tip

Pull irq fixes from Thomas Gleixner:
 "Interrupt chip driver fixes:

   - Don't install an hotplug notifier for GICV3-ITS on systems which do
     not need it to prevent a warning in the notifier about inconsistent
     state

   - Add the missing device tree matching for the T-HEAD PLIC variant so
     the related SoC is properly supported"

* tag 'irq-urgent-2022-02-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/sifive-plic: Add missing thead,c900-plic match string
  dt-bindings: update riscv plic compatible string
  irqchip/gic-v3-its: Skip HP notifier when no ITS is registered

2 years agoMerge tag 'objtool_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 13 Feb 2022 17:43:34 +0000 (09:43 -0800)]
Merge tag 'objtool_urgent_for_v5.17_rc4' of git://git./linux/kernel/git/tip/tip

Pull objtool fix from Borislav Petkov:
 "Fix a case where objtool would mistakenly warn about instructions
  being unreachable"

* tag 'objtool_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/bug: Merge annotate_reachable() into _BUG_FLAGS() asm

2 years agoMerge tag 'sched_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 13 Feb 2022 17:27:26 +0000 (09:27 -0800)]
Merge tag 'sched_urgent_for_v5.17_rc4' of git://git./linux/kernel/git/tip/tip

Pull scheduler fix from Borislav Petkov:
 "Fix a NULL-ptr dereference when recalculating a sched entity's weight"

* tag 'sched_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix fault in reweight_entity

2 years agoMerge tag 'perf_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 13 Feb 2022 17:25:26 +0000 (09:25 -0800)]
Merge tag 'perf_urgent_for_v5.17_rc4' of git://git./linux/kernel/git/tip/tip

Pull perf fix from Borislav Petkov:
 "Prevent cgroup event list corruption when switching events"

* tag 'perf_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix list corruption in perf_cgroup_switch()

2 years agoMerge tag 'x86_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 13 Feb 2022 17:22:52 +0000 (09:22 -0800)]
Merge tag 'x86_urgent_for_v5.17_rc4' of git://git./linux/kernel/git/tip/tip

Pull x86 fix from Borislav Petkov:
 "Prevent softlockups when tearing down large SGX enclaves"

* tag 'x86_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sgx: Silence softlockup detection when releasing large enclaves

2 years agoMerge tag '5.17-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Sun, 13 Feb 2022 17:16:45 +0000 (09:16 -0800)]
Merge tag '5.17-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Three small smb3 reconnect fixes and an error log clarification"

* tag '5.17-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: mark sessions for reconnection in helper function
  cifs: call helper functions for marking channels for reconnect
  cifs: call cifs_reconnect when a connection is marked
  [smb3] improve error message when mount options conflict with posix

2 years agoMerge tag 'irqchip-fixes-5.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git...
Thomas Gleixner [Sun, 13 Feb 2022 13:16:23 +0000 (14:16 +0100)]
Merge tag 'irqchip-fixes-5.17-2' of git://git./linux/kernel/git/maz/arm-platforms into irq/urgent

Pull irqchip fixes from Marc Zyngier:

 - Don't register a hotplug notifier on GICv3 systems that advertise
   LPI support, but have no ITS to make use of it

 - Add missing DT matching for the thead,c900-plic variant of the
   SiFive PLIC

Link: https://lore.kernel.org/r/20220211110038.1179155-1-maz@kernel.org
2 years agoMerge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Linus Torvalds [Sat, 12 Feb 2022 18:29:02 +0000 (10:29 -0800)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "Two minor fixes in the lpfc driver. One changing the classification of
  trace messages and the other fixing a build issue when NVME_FC is
  disabled"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: lpfc: Reduce log messages seen after firmware download
  scsi: lpfc: Remove NVMe support if kernel has NVME_FC disabled

2 years agoMerge tag 'char-misc-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregk...
Linus Torvalds [Sat, 12 Feb 2022 18:16:32 +0000 (10:16 -0800)]
Merge tag 'char-misc-5.17-rc4' of git://git./linux/kernel/git/gregkh/char-misc

Pull char/misc driver fixes from Greg KH:
 "Here are a small number of char/misc driver fixes for 5.17-rc4 for
  reported issues. They contain:

   - phy driver fixes

   - iio driver fix

   - eeprom driver fix

   - speakup regression fix

   - fastrpc fix

  All of these have been in linux-next with no reported issues"

* tag 'char-misc-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  iio: buffer: Fix file related error handling in IIO_BUFFER_GET_FD_IOCTL
  speakup-dectlk: Restore pitch setting
  bus: mhi: pci_generic: Add mru_default for Cinterion MV31-W
  bus: mhi: pci_generic: Add mru_default for Foxconn SDX55
  eeprom: ee1004: limit i2c reads to I2C_SMBUS_BLOCK_MAX
  misc: fastrpc: avoid double fput() on failed usercopy
  phy: dphy: Correct clk_pre parameter
  phy: phy-mtk-tphy: Fix duplicated argument in phy-mtk-tphy
  phy: stm32: fix a refcount leak in stm32_usbphyc_pll_enable()
  phy: xilinx: zynqmp: Fix bus width setting for SGMII
  phy: cadence: Sierra: fix error handling bugs in probe()
  phy: ti: Fix missing sentinel for clk_div_table
  phy: broadcom: Kconfig: Fix PHY_BRCM_USB config option
  phy: usb: Leave some clocks running during suspend

2 years agoMerge tag 'staging-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh...
Linus Torvalds [Sat, 12 Feb 2022 18:10:35 +0000 (10:10 -0800)]
Merge tag 'staging-5.17-rc4' of git://git./linux/kernel/git/gregkh/staging

Pullstaging driver fixes from Greg KH:
 "Here are two staging driver fixes for 5.17-rc4.  These are:

   - fbtft error path fix

   - vc04_services rcu dereference fix

  Both of these have been in linux-next for a while with no reported
  issues"

* tag 'staging-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  staging: fbtft: Fix error path in fbtft_driver_module_init()
  staging: vc04_services: Fix RCU dereference check

2 years agoMerge tag 'tty-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Linus Torvalds [Sat, 12 Feb 2022 18:01:55 +0000 (10:01 -0800)]
Merge tag 'tty-5.17-rc4' of git://git./linux/kernel/git/gregkh/tty

Pull tty/serial fixes from Greg KH:
 "Here are four small tty/serial fixes for 5.17-rc4.  They are:

   - 8250_pericom change revert to fix a reported regression

   - two speculation fixes for vt_ioctl

   - n_tty regression fix for polling

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'tty-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  vt_ioctl: add array_index_nospec to VT_ACTIVATE
  vt_ioctl: fix array_index_nospec in vt_setactivate
  serial: 8250_pericom: Revert "Re-enable higher baud rates"
  n_tty: wake up poll(POLLRDNORM) on receiving data

2 years agoMerge tag 'usb-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Linus Torvalds [Sat, 12 Feb 2022 17:56:18 +0000 (09:56 -0800)]
Merge tag 'usb-5.17-rc4' of git://git./linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
 "Here are some small USB driver fixes for 5.17-rc4 that resolve some
  reported issues and add new device ids:

   - usb-serial new device ids

   - ulpi cleanup fixes

   - f_fs use-after-free fix

   - dwc3 driver fixes

   - ax88179_178a usb network driver fix

   - usb gadget fixes

  There is a revert at the end of this series to resolve a build problem
  that 0-day found yesterday. Most of these have been in linux-next,
  except for the last few, and all have now passed 0-day tests"

* tag 'usb-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  Revert "usb: dwc2: drd: fix soft connect when gadget is unconfigured"
  usb: dwc2: drd: fix soft connect when gadget is unconfigured
  usb: gadget: rndis: check size of RNDIS_MSG_SET command
  USB: gadget: validate interface OS descriptor requests
  usb: core: Unregister device on component_add() failure
  net: usb: ax88179_178a: Fix out-of-bounds accesses in RX fixup
  usb: dwc3: gadget: Prevent core from processing stale TRBs
  USB: serial: cp210x: add CPI Bulk Coin Recycler id
  USB: serial: cp210x: add NCR Retail IO box id
  USB: serial: ftdi_sio: add support for Brainboxes US-159/235/320
  usb: gadget: f_uac2: Define specific wTerminalType
  usb: gadget: udc: renesas_usb3: Fix host to USB_ROLE_NONE transition
  usb: raw-gadget: fix handling of dual-direction-capable endpoints
  usb: usb251xb: add boost-up property support
  usb: ulpi: Call of_node_put correctly
  usb: ulpi: Move of_node_put to ulpi_dev_release
  USB: serial: option: add ZTE MF286D modem
  USB: serial: ch341: add support for GW Instek USB2.0-Serial devices
  usb: f_fs: Fix use-after-free for epfile
  usb: dwc3: xilinx: fix uninitialized return value

2 years agoMerge tag 's390-5.17-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Linus Torvalds [Sat, 12 Feb 2022 17:12:44 +0000 (09:12 -0800)]
Merge tag 's390-5.17-4' of git://git./linux/kernel/git/s390/linux

Pull s390 updates from Vasily Gorbik:
 "Maintainers and reviewers changes:

    - Add Alexander Gordeev as maintainer for s390.

    - Christian Borntraeger will focus on s390 KVM maintainership and
      stays as s390 reviewer.

  Fixes:

   - Fix clang build of modules loader KUnit test.

   - Fix kernel panic in CIO code on FCES path-event when no driver is
     attached to a device or the driver does not provide the path_event
     function"

* tag 's390-5.17-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/cio: verify the driver availability for path_event call
  s390/module: fix building test_modules_helpers.o with clang
  MAINTAINERS: downgrade myself to Reviewer for s390
  MAINTAINERS: add Alexander Gordeev as maintainer for s390

2 years agoMerge tag 'for-linus-5.17a-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 12 Feb 2022 17:08:57 +0000 (09:08 -0800)]
Merge tag 'for-linus-5.17a-rc4-tag' of git://git./linux/kernel/git/xen/tip

Pull xen fixes from Juergen Gross:

 - Two small cleanups

 - Another fix for addressing the EFI framebuffer above 4GB when running
   as Xen dom0

 - A patch to let Xen guests use reserved bits in MSI- and IO-APIC-
   registers for extended APIC-IDs the same way KVM guests are doing it
   already

* tag 'for-linus-5.17a-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/pci: Make use of the helper macro LIST_HEAD()
  xen/x2apic: Fix inconsistent indenting
  xen/x86: detect support for extended destination ID
  xen/x86: obtain full video frame buffer address for Dom0 also under EFI

2 years agoMerge tag 'seccomp-v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees...
Linus Torvalds [Sat, 12 Feb 2022 17:04:05 +0000 (09:04 -0800)]
Merge tag 'seccomp-v5.17-rc4' of git://git./linux/kernel/git/kees/linux

Pull seccomp fixes from Kees Cook:
 "This fixes a corner case of fatal SIGSYS being ignored since v5.15.
  Along with the signal fix is a change to seccomp so that seeing
  another syscall after a fatal filter result will cause seccomp to kill
  the process harder.

  Summary:

   - Force HANDLER_EXIT even for SIGNAL_UNKILLABLE

   - Make seccomp self-destruct after fatal filter results

   - Update seccomp samples for easier behavioral demonstration"

* tag 'seccomp-v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  samples/seccomp: Adjust sample to also provide kill option
  seccomp: Invalidate seccomp mode to catch death failures
  signal: HANDLER_EXIT should clear SIGNAL_UNKILLABLE

2 years agoMerge branch 'akpm' (patches from Andrew)
Linus Torvalds [Sat, 12 Feb 2022 16:57:37 +0000 (08:57 -0800)]
Merge branch 'akpm' (patches from Andrew)

Merge misc fixes from Andrew Morton:
 "5 patches.

  Subsystems affected by this patch series: binfmt, procfs, and mm
  (vmscan, memcg, and kfence)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  kfence: make test case compatible with run time set sample interval
  mm: memcg: synchronize objcg lists with a dedicated spinlock
  mm: vmscan: remove deadlock due to throttling failing to make progress
  fs/proc: task_mmu.c: don't read mapcount for migration entry
  fs/binfmt_elf: fix PT_LOAD p_align values for loaders

2 years agokconfig: fix failing to generate auto.conf
Jing Leng [Fri, 11 Feb 2022 09:27:36 +0000 (17:27 +0800)]
kconfig: fix failing to generate auto.conf

When the KCONFIG_AUTOCONFIG is specified (e.g. export \
KCONFIG_AUTOCONFIG=output/config/auto.conf), the directory of
include/config/ will not be created, so kconfig can't create deps
files in it and auto.conf can't be generated.

Signed-off-by: Jing Leng <jleng@ambarella.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2 years agoRevert "usb: dwc2: drd: fix soft connect when gadget is unconfigured"
Greg Kroah-Hartman [Sat, 12 Feb 2022 09:08:54 +0000 (10:08 +0100)]
Revert "usb: dwc2: drd: fix soft connect when gadget is unconfigured"

This reverts commit 269cbcf7b72de6f0016806d4a0cec1d689b55a87.

It causes build errors as reported by the kernel test robot.

Link: https://lore.kernel.org/r/202202112236.AwoOTtHO-lkp@intel.com
Reported-by: kernel test robot <lkp@intel.com>
Fixes: 269cbcf7b72d ("usb: dwc2: drd: fix soft connect when gadget is unconfigured")
Cc: stable@kernel.org
Cc: Amelie Delaunay <amelie.delaunay@foss.st.com>
Cc: Minas Harutyunyan <Minas.Harutyunyan@synopsys.com>
Cc: Fabrice Gasnier <fabrice.gasnier@foss.st.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 years agokfence: make test case compatible with run time set sample interval
Peng Liu [Sat, 12 Feb 2022 00:32:35 +0000 (16:32 -0800)]
kfence: make test case compatible with run time set sample interval

The parameter kfence_sample_interval can be set via boot parameter and
late shell command, which is convenient for automated tests and KFENCE
parameter optimization.  However, KFENCE test case just uses
compile-time CONFIG_KFENCE_SAMPLE_INTERVAL, which will make KFENCE test
case not run as users desired.  Export kfence_sample_interval, so that
KFENCE test case can use run-time-set sample interval.

Link: https://lkml.kernel.org/r/20220207034432.185532-1-liupeng256@huawei.com
Signed-off-by: Peng Liu <liupeng256@huawei.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Christian Knig <christian.koenig@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agomm: memcg: synchronize objcg lists with a dedicated spinlock
Roman Gushchin [Sat, 12 Feb 2022 00:32:32 +0000 (16:32 -0800)]
mm: memcg: synchronize objcg lists with a dedicated spinlock

Alexander reported a circular lock dependency revealed by the mmap1 ltp
test:

  LOCKDEP_CIRCULAR (suite: ltp, case: mtest06 (mmap1))
          WARNING: possible circular locking dependency detected
          5.17.0-20220113.rc0.git0.f2211f194038.300.fc35.s390x+debug #1 Not tainted
          ------------------------------------------------------
          mmap1/202299 is trying to acquire lock:
          00000001892c0188 (css_set_lock){..-.}-{2:2}, at: obj_cgroup_release+0x4a/0xe0
          but task is already holding lock:
          00000000ca3b3818 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x38/0x180
          which lock already depends on the new lock.
          the existing dependency chain (in reverse order) is:
          -> #1 (&sighand->siglock){-.-.}-{2:2}:
                 __lock_acquire+0x604/0xbd8
                 lock_acquire.part.0+0xe2/0x238
                 lock_acquire+0xb0/0x200
                 _raw_spin_lock_irqsave+0x6a/0xd8
                 __lock_task_sighand+0x90/0x190
                 cgroup_freeze_task+0x2e/0x90
                 cgroup_migrate_execute+0x11c/0x608
                 cgroup_update_dfl_csses+0x246/0x270
                 cgroup_subtree_control_write+0x238/0x518
                 kernfs_fop_write_iter+0x13e/0x1e0
                 new_sync_write+0x100/0x190
                 vfs_write+0x22c/0x2d8
                 ksys_write+0x6c/0xf8
                 __do_syscall+0x1da/0x208
                 system_call+0x82/0xb0
          -> #0 (css_set_lock){..-.}-{2:2}:
                 check_prev_add+0xe0/0xed8
                 validate_chain+0x736/0xb20
                 __lock_acquire+0x604/0xbd8
                 lock_acquire.part.0+0xe2/0x238
                 lock_acquire+0xb0/0x200
                 _raw_spin_lock_irqsave+0x6a/0xd8
                 obj_cgroup_release+0x4a/0xe0
                 percpu_ref_put_many.constprop.0+0x150/0x168
                 drain_obj_stock+0x94/0xe8
                 refill_obj_stock+0x94/0x278
                 obj_cgroup_charge+0x164/0x1d8
                 kmem_cache_alloc+0xac/0x528
                 __sigqueue_alloc+0x150/0x308
                 __send_signal+0x260/0x550
                 send_signal+0x7e/0x348
                 force_sig_info_to_task+0x104/0x180
                 force_sig_fault+0x48/0x58
                 __do_pgm_check+0x120/0x1f0
                 pgm_check_handler+0x11e/0x180
          other info that might help us debug this:
           Possible unsafe locking scenario:
                 CPU0                    CPU1
                 ----                    ----
            lock(&sighand->siglock);
                                         lock(css_set_lock);
                                         lock(&sighand->siglock);
            lock(css_set_lock);
           *** DEADLOCK ***
          2 locks held by mmap1/202299:
           #0: 00000000ca3b3818 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x38/0x180
           #1: 00000001892ad560 (rcu_read_lock){....}-{1:2}, at: percpu_ref_put_many.constprop.0+0x0/0x168
          stack backtrace:
          CPU: 15 PID: 202299 Comm: mmap1 Not tainted 5.17.0-20220113.rc0.git0.f2211f194038.300.fc35.s390x+debug #1
          Hardware name: IBM 3906 M04 704 (LPAR)
          Call Trace:
            dump_stack_lvl+0x76/0x98
            check_noncircular+0x136/0x158
            check_prev_add+0xe0/0xed8
            validate_chain+0x736/0xb20
            __lock_acquire+0x604/0xbd8
            lock_acquire.part.0+0xe2/0x238
            lock_acquire+0xb0/0x200
            _raw_spin_lock_irqsave+0x6a/0xd8
            obj_cgroup_release+0x4a/0xe0
            percpu_ref_put_many.constprop.0+0x150/0x168
            drain_obj_stock+0x94/0xe8
            refill_obj_stock+0x94/0x278
            obj_cgroup_charge+0x164/0x1d8
            kmem_cache_alloc+0xac/0x528
            __sigqueue_alloc+0x150/0x308
            __send_signal+0x260/0x550
            send_signal+0x7e/0x348
            force_sig_info_to_task+0x104/0x180
            force_sig_fault+0x48/0x58
            __do_pgm_check+0x120/0x1f0
            pgm_check_handler+0x11e/0x180
          INFO: lockdep is turned off.

In this example a slab allocation from __send_signal() caused a
refilling and draining of a percpu objcg stock, resulted in a releasing
of another non-related objcg.  Objcg release path requires taking the
css_set_lock, which is used to synchronize objcg lists.

This can create a circular dependency with the sighandler lock, which is
taken with the locked css_set_lock by the freezer code (to freeze a
task).

In general it seems that using css_set_lock to synchronize objcg lists
makes any slab allocations and deallocation with the locked css_set_lock
and any intervened locks risky.

To fix the problem and make the code more robust let's stop using
css_set_lock to synchronize objcg lists and use a new dedicated spinlock
instead.

Link: https://lkml.kernel.org/r/Yfm1IHmoGdyUR81T@carbon.dhcp.thefacebook.com
Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Tested-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Jeremy Linton <jeremy.linton@arm.com>
Tested-by: Jeremy Linton <jeremy.linton@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agomm: vmscan: remove deadlock due to throttling failing to make progress
Mel Gorman [Sat, 12 Feb 2022 00:32:29 +0000 (16:32 -0800)]
mm: vmscan: remove deadlock due to throttling failing to make progress

A soft lockup bug in kcompactd was reported in a private bugzilla with
the following visible in dmesg;

  watchdog: BUG: soft lockup - CPU#33 stuck for 26s! [kcompactd0:479]
  watchdog: BUG: soft lockup - CPU#33 stuck for 52s! [kcompactd0:479]
  watchdog: BUG: soft lockup - CPU#33 stuck for 78s! [kcompactd0:479]
  watchdog: BUG: soft lockup - CPU#33 stuck for 104s! [kcompactd0:479]

The machine had 256G of RAM with no swap and an earlier failed
allocation indicated that node 0 where kcompactd was run was potentially
unreclaimable;

  Node 0 active_anon:29355112kB inactive_anon:2913528kB active_file:0kB
    inactive_file:0kB unevictable:64kB isolated(anon):0kB isolated(file):0kB
    mapped:8kB dirty:0kB writeback:0kB shmem:26780kB shmem_thp:
    0kB shmem_pmdmapped: 0kB anon_thp: 23480320kB writeback_tmp:0kB
    kernel_stack:2272kB pagetables:24500kB all_unreclaimable? yes

Vlastimil Babka investigated a crash dump and found that a task
migrating pages was trying to drain PCP lists;

  PID: 52922  TASK: ffff969f820e5000  CPU: 19  COMMAND: "kworker/u128:3"
  Call Trace:
     __schedule
     schedule
     schedule_timeout
     wait_for_completion
     __flush_work
     __drain_all_pages
     __alloc_pages_slowpath.constprop.114
     __alloc_pages
     alloc_migration_target
     migrate_pages
     migrate_to_node
     do_migrate_pages
     cpuset_migrate_mm_workfn
     process_one_work
     worker_thread
     kthread
     ret_from_fork

This failure is specific to CONFIG_PREEMPT=n builds.  The root of the
problem is that kcompact0 is not rescheduling on a CPU while a task that
has isolated a large number of the pages from the LRU is waiting on
kcompact0 to reschedule so the pages can be released.  While
shrink_inactive_list() only loops once around too_many_isolated, reclaim
can continue without rescheduling if sc->skipped_deactivate == 1 which
could happen if there was no file LRU and the inactive anon list was not
low.

Link: https://lkml.kernel.org/r/20220203100326.GD3301@suse.de
Fixes: d818fca1cac3 ("mm/vmscan: throttle reclaim and compaction when too may pages are isolated")
Signed-off-by: Mel Gorman <mgorman@suse.de>
Debugged-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agofs/proc: task_mmu.c: don't read mapcount for migration entry
Yang Shi [Sat, 12 Feb 2022 00:32:26 +0000 (16:32 -0800)]
fs/proc: task_mmu.c: don't read mapcount for migration entry

The syzbot reported the below BUG:

  kernel BUG at include/linux/page-flags.h:785!
  invalid opcode: 0000 [#1] PREEMPT SMP KASAN
  CPU: 1 PID: 4392 Comm: syz-executor560 Not tainted 5.16.0-rc6-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  RIP: 0010:PageDoubleMap include/linux/page-flags.h:785 [inline]
  RIP: 0010:__page_mapcount+0x2d2/0x350 mm/util.c:744
  Call Trace:
    page_mapcount include/linux/mm.h:837 [inline]
    smaps_account+0x470/0xb10 fs/proc/task_mmu.c:466
    smaps_pte_entry fs/proc/task_mmu.c:538 [inline]
    smaps_pte_range+0x611/0x1250 fs/proc/task_mmu.c:601
    walk_pmd_range mm/pagewalk.c:128 [inline]
    walk_pud_range mm/pagewalk.c:205 [inline]
    walk_p4d_range mm/pagewalk.c:240 [inline]
    walk_pgd_range mm/pagewalk.c:277 [inline]
    __walk_page_range+0xe23/0x1ea0 mm/pagewalk.c:379
    walk_page_vma+0x277/0x350 mm/pagewalk.c:530
    smap_gather_stats.part.0+0x148/0x260 fs/proc/task_mmu.c:768
    smap_gather_stats fs/proc/task_mmu.c:741 [inline]
    show_smap+0xc6/0x440 fs/proc/task_mmu.c:822
    seq_read_iter+0xbb0/0x1240 fs/seq_file.c:272
    seq_read+0x3e0/0x5b0 fs/seq_file.c:162
    vfs_read+0x1b5/0x600 fs/read_write.c:479
    ksys_read+0x12d/0x250 fs/read_write.c:619
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x44/0xae

The reproducer was trying to read /proc/$PID/smaps when calling
MADV_FREE at the mean time.  MADV_FREE may split THPs if it is called
for partial THP.  It may trigger the below race:

           CPU A                         CPU B
           -----                         -----
  smaps walk:                      MADV_FREE:
  page_mapcount()
    PageCompound()
                                   split_huge_page()
    page = compound_head(page)
    PageDoubleMap(page)

When calling PageDoubleMap() this page is not a tail page of THP anymore
so the BUG is triggered.

This could be fixed by elevated refcount of the page before calling
mapcount, but that would prevent it from counting migration entries, and
it seems overkilling because the race just could happen when PMD is
split so all PTE entries of tail pages are actually migration entries,
and smaps_account() does treat migration entries as mapcount == 1 as
Kirill pointed out.

Add a new parameter for smaps_account() to tell this entry is migration
entry then skip calling page_mapcount().  Don't skip getting mapcount
for device private entries since they do track references with mapcount.

Pagemap also has the similar issue although it was not reported.  Fixed
it as well.

[shy828301@gmail.com: v4]
Link: https://lkml.kernel.org/r/20220203182641.824731-1-shy828301@gmail.com
[nathan@kernel.org: avoid unused variable warning in pagemap_pmd_range()]
Link: https://lkml.kernel.org/r/20220207171049.1102239-1-nathan@kernel.org
Link: https://lkml.kernel.org/r/20220120202805.3369-1-shy828301@gmail.com
Fixes: e9b61f19858a ("thp: reintroduce split_huge_page()")
Signed-off-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reported-by: syzbot+1f52b3a18d5633fa7f82@syzkaller.appspotmail.com
Acked-by: David Hildenbrand <david@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agofs/binfmt_elf: fix PT_LOAD p_align values for loaders
Mike Rapoport [Sat, 12 Feb 2022 00:32:22 +0000 (16:32 -0800)]
fs/binfmt_elf: fix PT_LOAD p_align values for loaders

Rui Salvaterra reported that Aisleroit solitaire crashes with "Wrong
__data_start/_end pair" assertion from libgc after update to v5.17-rc1.

Bisection pointed to commit 9630f0d60fec ("fs/binfmt_elf: use PT_LOAD
p_align values for static PIE") that fixed handling of static PIEs, but
made the condition that guards load_bias calculation to exclude loader
binaries.

Restoring the check for presence of interpreter fixes the problem.

Link: https://lkml.kernel.org/r/20220202121433.3697146-1-rppt@kernel.org
Fixes: 9630f0d60fec ("fs/binfmt_elf: use PT_LOAD p_align values for static PIE")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reported-by: Rui Salvaterra <rsalvaterra@gmail.com>
Tested-by: Rui Salvaterra <rsalvaterra@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: "H.J. Lu" <hjl.tools@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agoMerge tag 'soc-fixes-5.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Linus Torvalds [Fri, 11 Feb 2022 21:40:03 +0000 (13:40 -0800)]
Merge tag 'soc-fixes-5.17-1' of git://git./linux/kernel/git/soc/soc

Pull ARM SoC fixes from Arnd Bergmann:
 "This is a fairly large set of bugfixes, most of which had been sent a
  while ago but only now made it into the soc tree:

  Maintainer file updates:

   - Claudiu Beznea now co-maintains the at91 soc family, replacing
     Ludovic Desroches.

   - Michael Walle maintains the sl28cpld drivers

   - Alain Volmat and Raphael Gallais-Pou take over some drivers for ST
     platforms

   - Alim Akhtar is an additional reviewer for Samsung platforms

  Code fixes:

   - Op-tee had a problem with object lifetime that needs a slightly
     complex fix, as well as another bug with error handling.

   - Several minor issues for the OMAP platform, including a regression
     with the timer

   - A Kconfig change to fix a build-time issue on Intel SoCFPGA

  Device tree fixes:

   - The Amlogic Meson platform fixes a boot regression on am1-odroid, a
     spurious interrupt, and a problem with reserved memory regions

   - In the i.MX platform, several bug fixes are needed to make devices
     work correctly: SD card detection, alarmtimer, and sound card on
     some board. One patch for the GPU got in there by accident and gets
     reverted again.

   - TI K3 needs a fix for J721S2 serial port numbers

   - ux500 needs a fix to mount the SD card as root on the Skomer phone"

* tag 'soc-fixes-5.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (46 commits)
  Revert "arm64: dts: imx8mn-venice-gw7902: disable gpu"
  arm64: Remove ARCH_VULCAN
  MAINTAINERS: add myself as a maintainer for the sl28cpld
  MAINTAINERS: add IRC to ARM sub-architectures and Devicetree
  MAINTAINERS: arm: samsung: add Git tree and IRC
  ARM: dts: Fix boot regression on Skomer
  ARM: dts: spear320: Drop unused and undocumented 'irq-over-gpio' property
  soc: aspeed: lpc-ctrl: Block error printing on probe defer cases
  docs/ABI: testing: aspeed-uart-routing: Escape asterisk
  MAINTAINERS: update drm/stm drm/sti and cec/sti maintainers
  MAINTAINERS: Update Benjamin Gaignard maintainer status
  ARM: socfpga: fix missing RESET_CONTROLLER
  arm64: dts: meson-sm1-odroid: fix boot loop after reboot
  arm64: dts: meson-g12: drop BL32 region from SEI510/SEI610
  arm64: dts: meson-g12: add ATF BL32 reserved-memory region
  arm64: dts: meson-gx: add ATF BL32 reserved-memory region
  arm64: dts: meson-sm1-bananapi-m5: fix wrong GPIO domain for GPIOE_2
  arm64: dts: meson-sm1-odroid: use correct enable-gpio pin for tf-io regulator
  arm64: dts: meson-g12b-odroid-n2: fix typo 'dio2133'
  optee: use driver internal tee_context for some rpc
  ...

2 years agoMerge tag 'pci-v5.17-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaa...
Linus Torvalds [Fri, 11 Feb 2022 20:55:17 +0000 (12:55 -0800)]
Merge tag 'pci-v5.17-fixes-4' of git://git./linux/kernel/git/helgaas/pci

Pull pci fix from Bjorn Helgaas:
 "Revert a commit that reduced the number of IRQs used but resulted in
  interrupt storms (Bjorn Helgaas)"

* tag 'pci-v5.17-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
  Revert "PCI/portdrv: Do not setup up IRQs if there are no users"

2 years agoRevert "PCI/portdrv: Do not setup up IRQs if there are no users"
Bjorn Helgaas [Mon, 7 Feb 2022 22:33:30 +0000 (16:33 -0600)]
Revert "PCI/portdrv: Do not setup up IRQs if there are no users"

This reverts commit 0e8ae5a6ff5952253cd7cc0260df838ab4c21009.

0e8ae5a6ff59 ("PCI/portdrv: Do not setup up IRQs if there are no users")
reduced usage of IRQs when we don't think we need them.  But Joey, Sergiu,
and David reported choppy GUI rendering, systems that became unresponsive
every few seconds, incorrect values reported by cpufreq, and high IRQ 16
CPU usage.

Joey bisected the issues to 0e8ae5a6ff59, so revert it until we figure out
a better solution.

Link: https://lore.kernel.org/r/20220210222717.GA658201@bhelgaas
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215533
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215546
Reported-by: Joey Corleone <joey.corleone@mail.ru>
Reported-by: Sergiu Deitsch <sergiu.deitsch@gmail.com>
Reported-by: David Spencer <dspencer577@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org # v5.16+
Cc: Jan Kiszka <jan.kiszka@siemens.com>
2 years agoMerge tag 'riscv-for-linus-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 11 Feb 2022 20:02:09 +0000 (12:02 -0800)]
Merge tag 'riscv-for-linus-5.17-rc4' of git://git./linux/kernel/git/riscv/linux

Pull RISC-V fixes from Palmer Dabbelt:

 - A fix to avoid undefined behavior when stack backtracing, which
   manifests in GCC as incorrect stack addresses

 - A few fixes for the XIP kernels

 - A fix to tracking NUMA state on CPU hotplug

 - Support for the recently relesaed binutils-2.38, which changed the
   default ISA version to one without CSRs or fence.i in 'I' extension

* tag 'riscv-for-linus-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: fix build with binutils 2.38
  riscv: cpu-hotplug: clear cpu from numa map when teardown
  riscv: extable: fix err reg writing in dedicated uaccess handler
  riscv/mm: Add XIP_FIXUP for riscv_pfn_base
  riscv/mm: Add XIP_FIXUP for phys_ram_base
  riscv: Fix XIP_FIXUP_FLASH_OFFSET
  riscv: eliminate unreliable __builtin_frame_address(1)

2 years agoMerge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Linus Torvalds [Fri, 11 Feb 2022 19:55:26 +0000 (11:55 -0800)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux

Pull arm64 fixes from Catalin Marinas:

 - Enable Cortex-A510 erratum 2051678 by default as we do with other
   errata.

 - arm64 IORT: Check the node revision for PMCG resources to cope with
   old firmware based on a broken revision of the spec that had no way
   to describe the second register page (when an implementation is using
   the recommended RELOC_CTRS feature).

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  ACPI/IORT: Check node revision for PMCG resources
  arm64: Enable Cortex-A510 erratum 2051678 by default

2 years agoMerge tag 'acpi-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael...
Linus Torvalds [Fri, 11 Feb 2022 19:48:13 +0000 (11:48 -0800)]
Merge tag 'acpi-5.17-rc4' of git://git./linux/kernel/git/rafael/linux-pm

Pull ACPI fixes from Rafael Wysocki:
 "These revert two commits that turned out to be problematic and fix two
  issues related to wakeup from suspend-to-idle on x86.

  Specifics:

   - Revert a recent change that attempted to avoid issues with
     conflicting address ranges during PCI initialization, because it
     turned out to introduce a regression (Hans de Goede).

   - Revert a change that limited EC GPE wakeups from suspend-to-idle to
     systems based on Intel hardware, because it turned out that systems
     based on hardware from other vendors depended on that functionality
     too (Mario Limonciello).

   - Fix two issues related to the handling of wakeup interrupts and
     wakeup events signaled through the EC GPE during suspend-to-idle on
     x86 (Rafael Wysocki)"

* tag 'acpi-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  x86/PCI: revert "Ignore E820 reservations for bridge windows on newer systems"
  PM: s2idle: ACPI: Fix wakeup interrupts handling
  ACPI: PM: s2idle: Cancel wakeup before dispatching EC GPE
  ACPI: PM: Revert "Only mark EC GPE for wakeup on Intel systems"

2 years agoMerge tag 'gfs2-v5.16-rc3-fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 11 Feb 2022 19:36:32 +0000 (11:36 -0800)]
Merge tag 'gfs2-v5.16-rc3-fixes2' of git://git./linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fixes from Andreas Gruenbacher:

 - Revert debug commit that causes unexpected data corruption

 - Fix muti-block reservation regression

* tag 'gfs2-v5.16-rc3-fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Fix gfs2_release for non-writers regression
  Revert "gfs2: check context in gfs2_glock_put"

2 years agoMerge tag 'block-5.17-2022-02-11' of git://git.kernel.dk/linux-block
Linus Torvalds [Fri, 11 Feb 2022 19:26:07 +0000 (11:26 -0800)]
Merge tag 'block-5.17-2022-02-11' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull request
      - nvme-tcp: fix bogus request completion when failing to send AER
        (Sagi Grimberg)
      - add the missing nvme_complete_req tracepoint for batched
        completion (Bean Huo)

 - Revert of the loop async autoclear issue that has continued to plague
   us this release. A few patchsets exists to improve this, but they are
   too invasive to be considered at this point (Tetsuo)

* tag 'block-5.17-2022-02-11' of git://git.kernel.dk/linux-block:
  loop: revert "make autoclear operation asynchronous"
  nvme-tcp: fix bogus request completion when failing to send AER
  nvme: add nvme_complete_req tracepoint for batched completion

2 years agoMerge tag 'io_uring-5.17-2022-02-11' of git://git.kernel.dk/linux-block
Linus Torvalds [Fri, 11 Feb 2022 19:18:42 +0000 (11:18 -0800)]
Merge tag 'io_uring-5.17-2022-02-11' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

 - Fix a false-positive warning from an older gcc (Alviro)

 - Allow oom killer invocations from io_uring_setup (Shakeel)

* tag 'io_uring-5.17-2022-02-11' of git://git.kernel.dk/linux-block:
  mm: io_uring: allow oom-killer from io_uring_setup
  io_uring: Clean up a false-positive warning from GCC 9.3.0

2 years agoMerge tag 'gpio-fixes-for-v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 11 Feb 2022 19:05:49 +0000 (11:05 -0800)]
Merge tag 'gpio-fixes-for-v5.17-rc4' of git://git./linux/kernel/git/brgl/linux

Pull gpio fixes from Bartosz Golaszewski:

 - use sleeping variants of GPIO accessors where needed
   in gpio-aggregator

 - never return kernel's internal error codes to user-space
   in gpiolib core

 - use the correct register for reading output values in
   gpio-sifive

 - fix line hogging in gpio-sim

* tag 'gpio-fixes-for-v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
  gpio: sim: fix hogs with custom chip labels
  gpio: sifive: use the correct register to read output values
  gpiolib: Never return internal error codes to user space
  gpio: aggregator: Fix calling into sleeping GPIO controllers

2 years agoMerge tag 'ata-5.17-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal...
Linus Torvalds [Fri, 11 Feb 2022 18:42:31 +0000 (10:42 -0800)]
Merge tag 'ata-5.17-rc4-2' of git://git./linux/kernel/git/dlemoal/libata

Pull ata fixes from Damien Le Moal:
 "A couple of additional fixes for 5.17-rc4:

   - Fix compilation warnings in the sata_fsl driver (powerpc) (me)

   - Disable TRIM commands on M88V29 devices as these commands are
     failing despite the device reporting it supports TRIM (Zoltan)"

* tag 'ata-5.17-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
  ata: libata-core: Disable TRIM on M88V29
  ata: sata_fsl: fix sscanf() and sysfs_emit() format strings

2 years agoMerge tag 'drm-fixes-2022-02-11' of git://anongit.freedesktop.org/drm/drm
Linus Torvalds [Fri, 11 Feb 2022 18:35:12 +0000 (10:35 -0800)]
Merge tag 'drm-fixes-2022-02-11' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Dave Airlie:
 "Regular fixes pull, mostly i915 and amd fixes, along with a
  maintainers update for fbdev core.

  Otherwise just some build fixes and vc4 HDMI fixes.

  fbdev:
   - MAINTAINERS: add Daniel as fbdev core module maintainer
   - build warning fix
   - implicit type cast fix

  panel:
   - simple: Fix assignments from panel_dpi_probe()

  privacy-screen:
   - fix docs warning

  i915:
   - non-x86 build fix
   - ttm error propogation fix
   - drrs on hsw/ivb disabled
   - BIOS readout fixes
   - missing stackdepot oops fix

  amd:
   - DCN 3.1 display fixes
   - GC 10.3.1 harvest fix
   - Page flip irq fix
   - hwmon label fix
   - DCN 2.0 display fix

  rockchip:
   - fix HDMI error cleanup
   - fix RK3399 VOP register fields

  vc4:
   - HDMI fixes
   - remove redundant code"

* tag 'drm-fixes-2022-02-11' of git://anongit.freedesktop.org/drm/drm: (25 commits)
  drm/amdgpu/display: change pipe policy for DCN 2.0
  drm/amd/pm: fix hwmon node of power1_label create issue
  drm/amd/display: keep eDP Vdd on when eDP stream is already enabled
  drm/amd/display: fix yellow carp wm clamping
  drm/amd/display: Cap pflip irqs per max otg number
  drm/amdgpu: add utcl2_harvest to gc 10.3.1
  display/amd: decrease message verbosity about watermarks table failure
  drm/rockchip: vop: Correct RK3399 VOP register fields
  drm/rockchip: dw_hdmi: Do not leave clock enabled in error case
  MAINTAINERS: Add entry for fbdev core
  fbcon: Avoid 'cap' set but not used warning
  drm/privacy-screen: Fix sphinx warning
  drm/i915: Workaround broken BIOS DBUF configuration on TGL/RKL
  drm/i915: Populate pipe dbuf slices more accurately during readout
  drm/i915: Allow !join_mbus cases for adlp+ dbuf configuration
  drm/i915: Fix header test for !CONFIG_X86
  drm/i915/ttm: Return some errors instead of trying memcpy move
  drm/i915: Disable DRRS on IVB/HSW port != A
  drm/i915: Fix oops due to missing stack depot
  drm/vc4: crtc: Fix redundant variable assignment
  ...

2 years agoMerge tag 'trace-v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt...
Linus Torvalds [Fri, 11 Feb 2022 18:22:48 +0000 (10:22 -0800)]
Merge tag 'trace-v5.17-rc2' of git://git./linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fixes to the RTLA tooling

 - A fix to a tp_printk overriding tp_printk_stop_on_boot on the
   command line

* tag 'trace-v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix tp_printk option related with tp_printk_stop_on_boot
  MAINTAINERS: Add RTLA entry
  rtla: Fix segmentation fault when failing to enable -t
  rtla/trace: Error message fixup
  rtla/utils: Fix session duration parsing
  rtla: Follow kernel version

2 years agoKVM: SVM: fix race between interrupt delivery and AVIC inhibition
Maxim Levitsky [Tue, 8 Feb 2022 11:48:42 +0000 (06:48 -0500)]
KVM: SVM: fix race between interrupt delivery and AVIC inhibition

If svm_deliver_avic_intr is called just after the target vcpu's AVIC got
inhibited, it might read a stale value of vcpu->arch.apicv_active
which can lead to the target vCPU not noticing the interrupt.

To fix this use load-acquire/store-release so that, if the target vCPU
is IN_GUEST_MODE, we're guaranteed to see a previous disabling of the
AVIC.  If AVIC has been disabled in the meanwhile, proceed with the
KVM_REQ_EVENT-based delivery.

Incomplete IPI vmexit has the same races as svm_deliver_avic_intr, and
in fact it can be handled in exactly the same way; the only difference
lies in who has set IRR, whether svm_deliver_interrupt or the processor.
Therefore, svm_complete_interrupt_delivery can be used to fix incomplete
IPI vmexits as well.

Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>