platform/kernel/linux-starfive.git
15 months agoxfs: cross-reference rmap records with inode btrees
Darrick J. Wong [Wed, 12 Apr 2023 02:00:39 +0000 (19:00 -0700)]
xfs: cross-reference rmap records with inode btrees

Strengthen the rmap btree record checker a little more by comparing
OWN_INOBT reverse mappings against the inode btrees.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: cross-reference rmap records with free space btrees
Darrick J. Wong [Wed, 12 Apr 2023 02:00:38 +0000 (19:00 -0700)]
xfs: cross-reference rmap records with free space btrees

Strengthen the rmap btree record checker a little more by comparing
OWN_AG reverse mappings against the free space btrees, the rmap btree,
and the AGFL.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: cross-reference rmap records with ag btrees
Darrick J. Wong [Wed, 12 Apr 2023 02:00:38 +0000 (19:00 -0700)]
xfs: cross-reference rmap records with ag btrees

Strengthen the rmap btree record checker a little more by comparing
OWN_FS and OWN_LOG reverse mappings against the AG headers and internal
logs, respectively.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: introduce bitmap type for AG blocks
Darrick J. Wong [Wed, 12 Apr 2023 02:00:37 +0000 (19:00 -0700)]
xfs: introduce bitmap type for AG blocks

Create a typechecked bitmap for extents within an AG.  Online repair
uses bitmaps to store various different types of numbers, so let's make
it obvious when we're storing xfs_agblock_t (and later xfs_fsblock_t)
versus anything else.

In subsequent patches, we're going to use agblock bitmaps to enhance the
rmapbt checker to look for discrepancies between the rmapbt records and
AG metadata block usage.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: convert xbitmap to interval tree
Darrick J. Wong [Wed, 12 Apr 2023 02:00:36 +0000 (19:00 -0700)]
xfs: convert xbitmap to interval tree

Convert the xbitmap code to use interval trees instead of linked lists.
This reduces the amount of coding required to handle the disunion
operation and in the future will make it easier to set bits in arbitrary
order yet later be able to extract maximally sized extents, which we'll
need for rebuilding certain structures.  We define our own interval tree
type so that it can deal with 64-bit indices even on 32-bit machines.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: drop the _safe behavior from the xbitmap foreach macro
Darrick J. Wong [Wed, 12 Apr 2023 02:00:36 +0000 (19:00 -0700)]
xfs: drop the _safe behavior from the xbitmap foreach macro

It's not safe to edit bitmap intervals while we're iterating them with
for_each_xbitmap_extent.  None of the existing callers actually need
that ability anyway, so drop the safe variable.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: remove the for_each_xbitmap_ helpers
Darrick J. Wong [Wed, 12 Apr 2023 02:00:35 +0000 (19:00 -0700)]
xfs: remove the for_each_xbitmap_ helpers

Remove the for_each_xbitmap_ macros in favor of proper iterator
functions.  We'll soon be switching this data structure over to an
interval tree implementation, which means that we can't allow callers to
modify the bitmap during iteration without telling us.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: don't load local xattr values during scrub
Darrick J. Wong [Wed, 12 Apr 2023 02:00:35 +0000 (19:00 -0700)]
xfs: don't load local xattr values during scrub

Local extended attributes store their values within the same leaf block.
There's no header for the values themselves, nor are they separately
checksummed.  Hence we can save a bit of time in the attr scrubber by
not wasting time retrieving the values.

Regrettably, shortform attributes do not set XFS_ATTR_LOCAL so this
offers us no advantage there, but at least there are very few attrs in
that case.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: only allocate free space bitmap for xattr scrub if needed
Darrick J. Wong [Wed, 12 Apr 2023 02:00:34 +0000 (19:00 -0700)]
xfs: only allocate free space bitmap for xattr scrub if needed

The free space bitmap is only required if we're going to check the
bestfree space at the end of an xattr leaf block.  Therefore, we can
reduce the memory requirements of this scrubber if we can determine that
the xattr is in short format.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: clean up xattr scrub initialization
Darrick J. Wong [Wed, 12 Apr 2023 02:00:34 +0000 (19:00 -0700)]
xfs: clean up xattr scrub initialization

Clean up local variable initialization and error returns in xchk_xattr.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: check used space of shortform xattr structures
Darrick J. Wong [Wed, 12 Apr 2023 02:00:33 +0000 (19:00 -0700)]
xfs: check used space of shortform xattr structures

Make sure that the records used inside a shortform xattr structure do
not overlap.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: move xattr scrub buffer allocation to top level function
Darrick J. Wong [Wed, 12 Apr 2023 02:00:32 +0000 (19:00 -0700)]
xfs: move xattr scrub buffer allocation to top level function

Move the xchk_setup_xattr_buf call from xchk_xattr_block to xchk_xattr,
since we only need to set up the leaf block bitmaps once.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: remove flags argument from xchk_setup_xattr_buf
Darrick J. Wong [Wed, 12 Apr 2023 02:00:32 +0000 (19:00 -0700)]
xfs: remove flags argument from xchk_setup_xattr_buf

All callers pass XCHK_GFP_FLAGS as the flags argument to
xchk_setup_xattr_buf, so get rid of the argument.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: split valuebuf from xchk_xattr_buf.buf
Darrick J. Wong [Wed, 12 Apr 2023 02:00:31 +0000 (19:00 -0700)]
xfs: split valuebuf from xchk_xattr_buf.buf

Move the xattr value buffer from somewhere in xchk_xattr_buf.buf[] to an
explicit pointer.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: split usedmap from xchk_xattr_buf.buf
Darrick J. Wong [Wed, 12 Apr 2023 02:00:31 +0000 (19:00 -0700)]
xfs: split usedmap from xchk_xattr_buf.buf

Move the used space bitmap from somewhere in xchk_xattr_buf.buf[] to an
explicit pointer.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: split freemap from xchk_xattr_buf.buf
Darrick J. Wong [Wed, 12 Apr 2023 02:00:30 +0000 (19:00 -0700)]
xfs: split freemap from xchk_xattr_buf.buf

Move the free space bitmap from somewhere in xchk_xattr_buf.buf[] to an
explicit pointer.  This is the start of removing the complex overloaded
memory buffer that is the source of weird memory misuse bugs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: remove unnecessary dstmap in xattr scrubber
Darrick J. Wong [Wed, 12 Apr 2023 02:00:30 +0000 (19:00 -0700)]
xfs: remove unnecessary dstmap in xattr scrubber

Replace bitmap_and with bitmap_intersects in the xattr leaf block
scrubber, since we only care if there's overlap between the used space
bitmap and the free space bitmap.  This means we don't need dstmap any
more, and can thus reduce the memory requirements.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: don't shadow @leaf in xchk_xattr_block
Darrick J. Wong [Wed, 12 Apr 2023 02:00:29 +0000 (19:00 -0700)]
xfs: don't shadow @leaf in xchk_xattr_block

Don't shadow the leaf variable here, because it's misleading to have one
place in the codebase where two variables with different types have the
same name.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: xattr scrub should ensure one namespace bit per name
Darrick J. Wong [Wed, 12 Apr 2023 02:00:29 +0000 (19:00 -0700)]
xfs: xattr scrub should ensure one namespace bit per name

Check that each extended attribute exists in only one namespace.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: check for reverse mapping records that could be merged
Darrick J. Wong [Wed, 12 Apr 2023 02:00:28 +0000 (19:00 -0700)]
xfs: check for reverse mapping records that could be merged

Enhance the rmap scrubber to flag adjacent records that could be merged.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: check overlapping rmap btree records
Darrick J. Wong [Wed, 12 Apr 2023 02:00:27 +0000 (19:00 -0700)]
xfs: check overlapping rmap btree records

The rmap btree scrubber doesn't contain sufficient checking for records
that cannot overlap but do anyway.  For the other btrees, this is
enforced by the inorder checks in xchk_btree_rec, but the rmap btree is
special because it allows overlapping records to handle shared data
extents.

Therefore, enhance the rmap btree record check function to compare each
record against the previous one so that we can detect overlapping rmap
records for space allocations that do not allow sharing.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: flag refcount btree records that could be merged
Darrick J. Wong [Wed, 12 Apr 2023 02:00:27 +0000 (19:00 -0700)]
xfs: flag refcount btree records that could be merged

Complain if we encounter refcount btree records that could be merged.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: flag free space btree records that could be merged
Darrick J. Wong [Wed, 12 Apr 2023 02:00:26 +0000 (19:00 -0700)]
xfs: flag free space btree records that could be merged

Complain if we encounter free space btree records that could be merged.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: don't call xchk_bmap_check_rmaps for btree-format file forks
Darrick J. Wong [Wed, 12 Apr 2023 02:00:26 +0000 (19:00 -0700)]
xfs: don't call xchk_bmap_check_rmaps for btree-format file forks

The logic at the end of xchk_bmap_want_check_rmaps tries to detect a
file fork that has been zapped by what will become the online inode
repair code.  Zapped forks are in FMT_EXTENTS with zero extents, and
some sort of hint that there's supposed to be data somewhere in the
filesystem.

Unfortunately, the inverted logic here is confusing and has the effect
that we always call xchk_bmap_check_rmaps for FMT_BTREE forks.  This is
horribly inefficient and unnecessary, so invert the logic to get rid of
this performance problem.  This has caused 8h delays in generic/333 and
generic/334.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: split the xchk_bmap_check_rmaps into a predicate
Darrick J. Wong [Wed, 12 Apr 2023 02:00:25 +0000 (19:00 -0700)]
xfs: split the xchk_bmap_check_rmaps into a predicate

This function has two parts: the second part scans every reverse mapping
record for this file fork to make sure that there's a corresponding
mapping in the fork, and the first part decides if we even want to do
that.

Split the first part into a separate predicate so that we can make more
changes to it in the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: alert the user about data/attr fork mappings that could be merged
Darrick J. Wong [Wed, 12 Apr 2023 02:00:25 +0000 (19:00 -0700)]
xfs: alert the user about data/attr fork mappings that could be merged

If the data or attr forks have mappings that could be merged, let the
user know that the structure could be optimized.  This isn't a
filesystem corruption since the regular filesystem does not try to be
smart about merging bmbt records.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: split xchk_bmap_xref_rmap into two functions
Darrick J. Wong [Wed, 12 Apr 2023 02:00:24 +0000 (19:00 -0700)]
xfs: split xchk_bmap_xref_rmap into two functions

There's more special-cased functionality than not in this function.
Split it into two so that each can be far more cohesive.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: accumulate iextent records when checking bmap
Darrick J. Wong [Wed, 12 Apr 2023 02:00:24 +0000 (19:00 -0700)]
xfs: accumulate iextent records when checking bmap

Currently, the bmap scrubber checks file fork mappings individually.  In
the case that the file uses multiple mappings to a single contiguous
piece of space, the scrubber repeatedly locks the AG to check the
existence of a reverse mapping that overlaps this file mapping.  If the
reverse mapping starts before or ends after the mapping we're checking,
it will also crawl around in the bmbt checking correspondence for
adjacent extents.

This is not very time efficient because it does the crawling while
holding the AGF buffer, and checks the middle mappings multiple times.
Instead, create a custom iextent record iterator function that combines
multiple adjacent allocated mappings into one large incore bmbt record.
This is feasible because the incore bmbt record length is 64-bits wide.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: change bmap scrubber to store the previous mapping
Darrick J. Wong [Wed, 12 Apr 2023 02:00:23 +0000 (19:00 -0700)]
xfs: change bmap scrubber to store the previous mapping

Convert the inode data/attr/cow fork scrubber to remember the entire
previous mapping, not just the next expected offset.  No behavior
changes here, but this will enable some better checking in subsequent
patches.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: don't take the MMAPLOCK when scrubbing file metadata
Darrick J. Wong [Wed, 12 Apr 2023 02:00:22 +0000 (19:00 -0700)]
xfs: don't take the MMAPLOCK when scrubbing file metadata

The MMAPLOCK stabilizes mappings in a file's pagecache.  Therefore, we
do not need it to check directories, symlinks, extended attributes, or
file-based metadata.  Reduce its usage to the one case that requires it,
which is when we want to scrub the data fork of a regular file.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: retain the AGI when we can't iget an inode to scrub the core
Darrick J. Wong [Wed, 12 Apr 2023 02:00:22 +0000 (19:00 -0700)]
xfs: retain the AGI when we can't iget an inode to scrub the core

xchk_get_inode is not quite the right function to be calling from the
inode scrubber setup function.  The common get_inode function either
gets an inode and installs it in the scrub context, or it returns an
error code explaining what happened.  This is acceptable for most file
scrubbers because it is not in their scope to fix corruptions in the
inode core and fork areas that cause iget to fail.

Dealing with these problems is within the scope of the inode scrubber,
however.  If iget fails with EFSCORRUPTED, we need to xchk_inode to flag
that as corruption.  Since we can't get our hands on an incore inode, we
need to hold the AGI to prevent inode allocation activity so that
nothing changes in the inode metadata.

Looking ahead to the inode core repair patches, we will also need to
hold the AGI buffer into xrep_inode so that we can make modifications to
the xfs_dinode structure without any other thread swooping in to
allocate or free the inode.

Adapt the xchk_get_inode into xchk_setup_inode since this is a one-off
use case where the error codes we check for are a little different, and
the return state is much different from the common function.

xchk_setup_inode prepares to check or repair an inode record, so it must
continue the scrub operation even if the inode/inobt verifiers cause
xfs_iget to return EFSCORRUPTED.  This is done by attaching the locked
AGI buffer to the scrub transaction and returning 0 to move on to the
actual scrub.  (Later, the online inode repair code will also want the
xfs_imap structure so that it can reset the ondisk xfs_dinode
structure.)

xchk_get_inode retrieves an inode on behalf of a scrubber that operates
on an incore inode -- data/attr/cow forks, directories, xattrs,
symlinks, parent pointers, etc.  If the inode/inobt verifiers fail and
xfs_iget returns EFSCORRUPTED, we want to exit to userspace (because the
caller should be fix the inode first) and drop everything we acquired
along the way.

A behavior common to both functions is that it's possible that xfs_scrub
asked for a scrub-by-handle concurrent with the inode being freed or the
passed-in inumber is invalid.  In this case, we call xfs_imap to see if
the inobt index thinks the inode is allocated, and return ENOENT
("nothing to check here") to userspace if this is not the case.  The
imap lookup is why both functions call xchk_iget_agi.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: rename xchk_get_inode -> xchk_iget_for_scrubbing
Darrick J. Wong [Wed, 12 Apr 2023 02:00:21 +0000 (19:00 -0700)]
xfs: rename xchk_get_inode -> xchk_iget_for_scrubbing

Dave Chinner suggested renaming this function to make more obvious what
it does.  The function returns an incore inode to callers that want to
scrub a metadata structure that hangs off an inode.  If the iget fails
with EINVAL, it will single-step the loading process to distinguish
between actually free inodes or impossible inumbers (ENOENT);
discrepancies between the inobt freemask and the free status in the
inode record (EFSCORRUPTED).  Any other negative errno is returned
unchanged.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: fix an inode lookup race in xchk_get_inode
Darrick J. Wong [Wed, 12 Apr 2023 02:00:21 +0000 (19:00 -0700)]
xfs: fix an inode lookup race in xchk_get_inode

In commit d658e, we tried to improve the robustnes of xchk_get_inode in
the face of EINVAL returns from iget by calling xfs_imap to see if the
inobt itself thinks that the inode is allocated.  Unfortunately, that
commit didn't consider the possibility that the inode gets allocated
after iget but before imap.  In this case, the imap call will succeed,
but we turn that into a corruption error and tell userspace the inode is
corrupt.

Avoid this false corruption report by grabbing the AGI header and
retrying the iget before calling imap.  If the iget succeeds, we can
proceed with the usual scrub-by-handle code.  Fix all the incorrect
comments too, since unreadable/corrupt inodes no longer result in EINVAL
returns.

Fixes: d658e72b4a09 ("xfs: distinguish between corrupt inode and invalid inum in xfs_scrub_get_inode")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: manage inode DONTCACHE status at irele time
Darrick J. Wong [Wed, 12 Apr 2023 02:00:20 +0000 (19:00 -0700)]
xfs: manage inode DONTCACHE status at irele time

Right now, there are statements scattered all over the online fsck
codebase about how we can't use XFS_IGET_DONTCACHE because of concerns
about scrub's unusual practice of releasing inodes with transactions
held.

However, iget is the wrong place to handle this -- the DONTCACHE state
doesn't matter at all until we try to *release* the inode, and here we
get things wrong in multiple ways:

First, if we /do/ have a transaction, we must NOT drop the inode,
because the inode could have dirty pages, dropping the inode will
trigger writeback, and writeback can trigger a nested transaction.

Second, if the inode already had an active reference and the DONTCACHE
flag set, the icache hit when scrub grabs another ref will not clear
DONTCACHE.  This is sort of by design, since DONTCACHE is now used to
initiate cache drops so that sysadmins can change a file's access mode
between pagecache and DAX.

Third, if we do actually have the last active reference to the inode, we
can set DONTCACHE to avoid polluting the cache.  This is the /one/ case
where we actually want that flag.

Create an xchk_irele helper to encode all that logic and switch the
online fsck code to use it.  Since this now means that nearly all
scrubbers use the same xfs_iget flags, we can wrap them too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: fix parent pointer scrub racing with subdirectory reparenting
Darrick J. Wong [Wed, 12 Apr 2023 02:00:20 +0000 (19:00 -0700)]
xfs: fix parent pointer scrub racing with subdirectory reparenting

Jan Kara pointed out that rename() doesn't lock a subdirectory that is
being moved from one parent to another, even though the move requires an
update to the subdirectory's dotdot entry.  This means that it's *not*
sufficient to hold a directory's IOLOCK to stabilize the dotdot entry.
We must hold the ILOCK of both the child and the alleged parent, and
there's no use in holding the parent's IOLOCK.

With that in mind, we can get rid of all the messy code that tries to
grab the parent's IOLOCK, which means we don't need to let go of the
ILOCK of the directory whose parent we are checking.  We still have to
use nonblocking mode to take the ILOCK of the alleged parent, so the
revalidation loop has to stay.

However, we can remove the retry counter, since threads aren't supposed
to hold the ILOCK for long periods of time.  Remove the inverted ilock
helper from the common code since nobody uses it.  Remove the entire
source of -EDEADLOCK-based "retry harder" scrub executions.

Link: https://lore.kernel.org/linux-xfs/20230117123735.un7wbamlbdihninm@quack3/
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: simplify xchk_parent_validate
Darrick J. Wong [Wed, 12 Apr 2023 02:00:19 +0000 (19:00 -0700)]
xfs: simplify xchk_parent_validate

This function is unnecessarily long because it contains code to
revalidate a dotdot entry after cycling locks to try to confirm a
subdirectory parent pointer.  Shorten the codebase by making the
parent's lookup call do double duty as the revalidation code.

This weakeans the efficacy of this scrub function temporarily, but the
next patch will resolve this as part of fixing an unhandled race that is
the result of the VFS rename locking model not working the way Darrick
thought it did.

Rename this stupid 'dnum' variable too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: remove xchk_parent_count_parent_dentries
Darrick J. Wong [Wed, 12 Apr 2023 02:00:19 +0000 (19:00 -0700)]
xfs: remove xchk_parent_count_parent_dentries

This helper is now trivial, so get rid of it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: always check the existence of a dirent's child inode
Darrick J. Wong [Wed, 12 Apr 2023 02:00:18 +0000 (19:00 -0700)]
xfs: always check the existence of a dirent's child inode

When we're scrubbing directory entries, we always need to iget the child
inode to make sure that the inode pointer points to a valid inode.  The
original directory scrub code (commit a5c4) only set us up to do this
for ftype=1 filesystems, which is not sufficient; and then commit 4b80
made it worse by exempting the dot and dotdot entries.

Sorta-fixes: a5c46e5e8912 ("xfs: scrub directory metadata")
Sorta-fixes: 4b80ac64450f ("xfs: scrub should mark a directory corrupt if any entries cannot be iget'd")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: xfs_iget in the directory scrubber needs to use UNTRUSTED
Darrick J. Wong [Wed, 12 Apr 2023 02:00:17 +0000 (19:00 -0700)]
xfs: xfs_iget in the directory scrubber needs to use UNTRUSTED

In commit 4b80ac64450f, we tried to strengthen the directory scrubber by
using the iget call to detect directory entries that point to
unallocated inodes.  Unfortunately, that commit neglected to pass
XFS_IGET_UNTRUSTED to xfs_iget, so we don't check the inode btree first.
If the inode number points to something that isn't even an inode
cluster, iget will throw corruption errors and return -EFSCORRUPTED,
which means that we fail to mark the directory corrupt.

Fixes: 4b80ac64450f ("xfs: scrub should mark a directory corrupt if any entries cannot be iget'd")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: streamline the directory iteration code for scrub
Darrick J. Wong [Wed, 12 Apr 2023 02:00:17 +0000 (19:00 -0700)]
xfs: streamline the directory iteration code for scrub

Currently, online scrub reuses the xfs_readdir code to walk every entry
in a directory.  This isn't awesome for performance, since we end up
cycling the directory ILOCK needlessly and coding around the particular
quirks of the VFS dir_context interface.

Create a streamlined version of readdir that keeps the ILOCK (since the
walk function isn't going to copy stuff to userspace), skips a whole lot
of directory walk cursor checks (since we start at 0 and walk to the
end) and has a sane way to return error codes.

Note: Porting the dotdot checking code is left for a subsequent patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: use the directory name hash function for dir scrubbing
Darrick J. Wong [Wed, 12 Apr 2023 02:00:16 +0000 (19:00 -0700)]
xfs: use the directory name hash function for dir scrubbing

The directory code has a directory-specific hash computation function
that includes a modified hash function for case-insensitive lookups.
Hence we must use that function (and not the raw da_hashname) when
checking the dabtree structure.

Found by accidentally breaking xfs/188 to create an abnormally huge
case-insensitive directory and watching scrub break.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: ensure that single-owner file blocks are not owned by others
Darrick J. Wong [Wed, 12 Apr 2023 02:00:16 +0000 (19:00 -0700)]
xfs: ensure that single-owner file blocks are not owned by others

For any file fork mapping that can only have a single owner, make sure
that there are no other rmap owners for that mapping.  This patch
requires the more detailed checking provided by xfs_rmap_count_owners so
that we can know how many rmap records for a given range of space had a
matching owner, how many had a non-matching owner, and how many
conflicted with the records that have a matching owner.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: teach scrub to check for sole ownership of metadata objects
Darrick J. Wong [Wed, 12 Apr 2023 02:00:15 +0000 (19:00 -0700)]
xfs: teach scrub to check for sole ownership of metadata objects

Strengthen online scrub's checking even further by enabling us to check
that a range of blocks are owned solely by a given owner.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: convert xfs_ialloc_has_inodes_at_extent to return keyfill scan results
Darrick J. Wong [Wed, 12 Apr 2023 02:00:15 +0000 (19:00 -0700)]
xfs: convert xfs_ialloc_has_inodes_at_extent to return keyfill scan results

Convert the xfs_ialloc_has_inodes_at_extent function to return keyfill
scan results because for a given range of inode numbers, we might have
no indexed inodes at all; the entire region might be allocated ondisk
inodes; or there might be a mix of the two.

Unfortunately, sparse inodes adds to the complexity, because each inode
record can have holes, which means that we cannot use the generic btree
_scan_keyfill function because we must look for holes in individual
records to decide the result.  On the plus side, online fsck can now
detect sub-chunk discrepancies in the inobt.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: directly cross-reference the inode btrees with each other
Darrick J. Wong [Wed, 12 Apr 2023 02:00:14 +0000 (19:00 -0700)]
xfs: directly cross-reference the inode btrees with each other

Improve the cross-referencing of the two inode btrees by directly
checking the free and hole state of each inode with the other btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: clean up broken eearly-exit code in the inode btree scrubber
Darrick J. Wong [Wed, 12 Apr 2023 02:00:13 +0000 (19:00 -0700)]
xfs: clean up broken eearly-exit code in the inode btree scrubber

Corrupt inode chunks should cause us to exit early after setting the
CORRUPT flag on the scrub state.  While we're at it, collapse trivial
helpers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: remove pointless shadow variable from xfs_difree_inobt
Darrick J. Wong [Wed, 12 Apr 2023 02:00:13 +0000 (19:00 -0700)]
xfs: remove pointless shadow variable from xfs_difree_inobt

In xfs_difree_inobt, the pag passed in was previously used to look up
the AGI buffer.  There's no need to extract it again, so remove the
shadow variable and shut up -Wshadow.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: ensure that all metadata and data blocks are not cow staging extents
Darrick J. Wong [Wed, 12 Apr 2023 02:00:12 +0000 (19:00 -0700)]
xfs: ensure that all metadata and data blocks are not cow staging extents

Make sure that all filesystem metadata blocks and file data blocks are
not also marked as CoW staging extents.  The extra checking added here
was inspired by an actual VM host filesystem corruption incident due to
bugs in the CoW handling of 4.x kernels.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: check the reference counts of gaps in the refcount btree
Darrick J. Wong [Wed, 12 Apr 2023 02:00:12 +0000 (19:00 -0700)]
xfs: check the reference counts of gaps in the refcount btree

Gaps in the reference count btree are also significant -- for these
regions, there must not be any overlapping reverse mappings.  We don't
currently check this, so make the refcount scrubber more complete.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: implement masked btree key comparisons for _has_records scans
Darrick J. Wong [Wed, 12 Apr 2023 02:00:11 +0000 (19:00 -0700)]
xfs: implement masked btree key comparisons for _has_records scans

For keyspace fullness scans, we want to be able to mask off the parts of
the key that we don't care about.  For most btree types we /do/ want the
full keyspace, but for checking that a given space usage also has a full
complement of rmapbt records (even if different/multiple owners) we need
this masking so that we only track sparseness of rm_startblock, not the
whole keyspace (which is extremely sparse).

Augment the ->diff_two_keys and ->keys_contiguous helpers to take a
third union xfs_btree_key argument, and wire up xfs_rmap_has_records to
pass this through.  This third "mask" argument should contain a nonzero
value in each structure field that should be used in the key comparisons
done during the scan.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: replace xfs_btree_has_record with a general keyspace scanner
Darrick J. Wong [Wed, 12 Apr 2023 02:00:10 +0000 (19:00 -0700)]
xfs: replace xfs_btree_has_record with a general keyspace scanner

The current implementation of xfs_btree_has_record returns true if it
finds /any/ record within the given range.  Unfortunately, that's not
sufficient for scrub.  We want to be able to tell if a range of keyspace
for a btree is devoid of records, is totally mapped to records, or is
somewhere in between.  By forcing this to be a boolean, we conflated
sparseness and fullness, which caused scrub to return incorrect results.
Fix the API so that we can tell the caller which of those three is the
current state.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: refactor ->diff_two_keys callsites
Darrick J. Wong [Wed, 12 Apr 2023 02:00:10 +0000 (19:00 -0700)]
xfs: refactor ->diff_two_keys callsites

Create wrapper functions around ->diff_two_keys so that we don't have to
remember what the return values mean, and adjust some of the code
comments to reflect the longtime code behavior.  We're going to
introduce more uses of ->diff_two_keys in the next patch, so reduce the
cognitive load for readers by doing this refactoring now.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: refactor converting btree irec to btree key
Darrick J. Wong [Wed, 12 Apr 2023 02:00:09 +0000 (19:00 -0700)]
xfs: refactor converting btree irec to btree key

We keep doing these conversions to support btree queries, so refactor
this into a helper.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: always scrub record/key order of interior records
Darrick J. Wong [Wed, 12 Apr 2023 02:00:09 +0000 (19:00 -0700)]
xfs: always scrub record/key order of interior records

In commit d47fef9342d0, we removed the firstrec and firstkey fields of
struct xchk_btree because Christoph thought they were unnecessary
because we could use the record index in the btree cursor.  This is
incorrect because bc_ptrs (now bc_levels[].ptr) tracks the cursor
position within a specific btree block, not within the entire level.

The end result is that scrub no longer detects situations where the
rightmost record of a block is identical to the leftmost record of that
block's right sibling.  Fix this regression by reintroducing record
validity booleans so that order checking skips *only* the leftmost
record/key in each level.

Fixes: d47fef9342d0 ("xfs: don't track firstrec/firstkey separately in xchk_btree")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: check btree keys reflect the child block
Darrick J. Wong [Wed, 12 Apr 2023 02:00:08 +0000 (19:00 -0700)]
xfs: check btree keys reflect the child block

When scrub is checking a non-root btree block, it should make sure that
the keys in the parent btree block accurately capture the keyspace that
the child block stores.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: detect unwritten bit set in rmapbt node block keys
Darrick J. Wong [Wed, 12 Apr 2023 02:00:07 +0000 (19:00 -0700)]
xfs: detect unwritten bit set in rmapbt node block keys

In the last patch, we changed the rmapbt code to remove the UNWRITTEN
bit when creating an rmapbt key from an rmapbt record, and we changed
the rmapbt key comparison code to start considering the ATTR and BMBT
flags during lookup.  This brought the behavior of the rmapbt
implementation in line with its specification.

However, there may exist filesystems that have the unwritten bit still
set in the rmapbt keys.  We should detect these situations and flag the
rmapbt as one that would benefit from optimization.  Eventually, online
repair will be able to do something in response to this.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: fix rm_offset flag handling in rmap keys
Darrick J. Wong [Wed, 12 Apr 2023 02:00:07 +0000 (19:00 -0700)]
xfs: fix rm_offset flag handling in rmap keys

Keys for extent interval records in the reverse mapping btree are
supposed to be computed as follows:

(physical block, owner, fork, is_btree, offset)

This provides users the ability to look up a reverse mapping from a file
block mapping record -- start with the physical block; then if there are
multiple records for the same block, move on to the owner; then the
inode fork type; and so on to the file offset.

Unfortunately, the code that creates rmap lookup keys from rmap records
forgot to mask off the record attribute flags, leading to ondisk keys
that look like this:

(physical block, owner, fork, is_btree, unwritten state, offset)

Fortunately, this has all worked ok for the past six years because the
key comparison functions incorrectly ignore the fork/bmbt/unwritten
information that's encoded in the on-disk offset.  This means that
lookup comparisons are only done with:

(physical block, owner, offset)

Queries can (theoretically) return incorrect results because of this
omission.  On consistent filesystems this isn't an issue because xattr
and bmbt blocks cannot be shared and hence the comparisons succeed
purely on the contents of the rm_startblock field.  For the one case
where we support sharing (written data fork blocks) all flag bits are
zero, so the omission in the comparison has no ill effects.

Unfortunately, this bug prevents scrub from detecting incorrect fork and
bmbt flag bits in the rmap btree, so we really do need to fix the
compare code.  Old filesystems with the unwritten bit erroneously set in
the rmap key struct will work fine on new kernels since we still ignore
the unwritten bit.  New filesystems on older kernels will work fine
since the old kernels never paid attention to the unwritten bit.

A previous version of this patch forgot to keep the (un)written state
flag masked during the comparison and caused a major regression in
5.9.x since unwritten extent conversion can update an rmap record
without requiring key updates.

Note that blocks cannot go directly from data fork to attr fork without
being deallocated and reallocated, nor can they be added to or removed
from a bmbt without a free/alloc cycle, so this should not cause any
regressions.

Found by fuzzing keys[1].attrfork = ones on xfs/371.

Fixes: 4b8ed67794fe ("xfs: add rmap btree operations")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: hoist inode record alignment checks from scrub
Darrick J. Wong [Wed, 12 Apr 2023 02:00:06 +0000 (19:00 -0700)]
xfs: hoist inode record alignment checks from scrub

Move the inobt record alignment checks from xchk_iallocbt_rec into
xfs_inobt_check_irec so that they are applied everywhere.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: hoist rmap record flag checks from scrub
Darrick J. Wong [Wed, 12 Apr 2023 02:00:06 +0000 (19:00 -0700)]
xfs: hoist rmap record flag checks from scrub

Move the rmap record flag checks from xchk_rmapbt_rec into
xfs_rmap_check_irec so that they are applied everywhere.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: hoist rmap record flag checks from scrub
Darrick J. Wong [Wed, 12 Apr 2023 02:00:05 +0000 (19:00 -0700)]
xfs: hoist rmap record flag checks from scrub

Move the rmap record flag checks from xchk_rmapbt_rec into
xfs_rmap_check_irec so that they are applied everywhere.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: complain about bad file mapping records in the ondisk bmbt
Darrick J. Wong [Wed, 12 Apr 2023 02:00:05 +0000 (19:00 -0700)]
xfs: complain about bad file mapping records in the ondisk bmbt

Similar to what we've just done for the other btrees, create a function
to log corrupt bmbt records and call it whenever we encounter a bad
record in the ondisk btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: complain about bad records in query_range helpers
Darrick J. Wong [Wed, 12 Apr 2023 02:00:04 +0000 (19:00 -0700)]
xfs: complain about bad records in query_range helpers

For every btree type except for the bmbt, refactor the code that
complains about bad records into a helper and make the ->query_range
helpers call it so that corruptions found via that avenue are logged.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: standardize ondisk to incore conversion for bmap btrees
Darrick J. Wong [Wed, 12 Apr 2023 02:00:04 +0000 (19:00 -0700)]
xfs: standardize ondisk to incore conversion for bmap btrees

Fix all xfs_bmbt_disk_get_all callsites to call xfs_bmap_validate_extent
and bubble up corruption reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: standardize ondisk to incore conversion for rmap btrees
Darrick J. Wong [Wed, 12 Apr 2023 02:00:03 +0000 (19:00 -0700)]
xfs: standardize ondisk to incore conversion for rmap btrees

Create a xfs_rmap_check_irec function to detect corruption in btree
records.  Fix all xfs_rmap_btrec_to_irec callsites to call the new
helper and bubble up corruption reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: return a failure address from xfs_rmap_irec_offset_unpack
Darrick J. Wong [Wed, 12 Apr 2023 02:00:02 +0000 (19:00 -0700)]
xfs: return a failure address from xfs_rmap_irec_offset_unpack

Currently, xfs_rmap_irec_offset_unpack returns only 0 or -EFSCORRUPTED.
Change this function to return the code address of a failed conversion
in preparation for the next patch, which standardizes localized record
checking and reporting code.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: standardize ondisk to incore conversion for refcount btrees
Darrick J. Wong [Wed, 12 Apr 2023 02:00:02 +0000 (19:00 -0700)]
xfs: standardize ondisk to incore conversion for refcount btrees

Create a xfs_refcount_check_irec function to detect corruption in btree
records.  Fix all xfs_refcount_btrec_to_irec callsites to call the new
helper and bubble up corruption reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: standardize ondisk to incore conversion for inode btrees
Darrick J. Wong [Wed, 12 Apr 2023 02:00:01 +0000 (19:00 -0700)]
xfs: standardize ondisk to incore conversion for inode btrees

Create a xfs_inobt_check_irec function to detect corruption in btree
records.  Fix all xfs_inobt_btrec_to_irec callsites to call the new
helper and bubble up corruption reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: standardize ondisk to incore conversion for free space btrees
Darrick J. Wong [Wed, 12 Apr 2023 02:00:01 +0000 (19:00 -0700)]
xfs: standardize ondisk to incore conversion for free space btrees

Create a xfs_alloc_btrec_to_irec function to convert an ondisk record to
an incore record, and a xfs_alloc_check_irec function to detect
corruption.  Replace all the open-coded logic with calls to the new
helpers and bubble up corruption reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: scrub should use ECHRNG to signal that the drain is needed
Darrick J. Wong [Wed, 12 Apr 2023 02:00:00 +0000 (19:00 -0700)]
xfs: scrub should use ECHRNG to signal that the drain is needed

In the previous patch, we added jump labels to the intent drain code so
that regular filesystem operations need not pay the price of checking
for someone (scrub) waiting on intents to drain from some part of the
filesystem when that someone isn't running.

However, I observed that xfs/285 now spends a lot more time pushing the
AIL from the inode btree scrubber than it used to.  This is because the
inobt scrubber will try push the AIL to try to get logged inode cores
written to the filesystem when it sees a weird discrepancy between the
ondisk inode and the inobt records.  This AIL push is triggered when the
setup function sees TRY_HARDER is set; and the requisite EDEADLOCK
return is initiated when the discrepancy is seen.

The solution to this performance slow down is to use a different result
code (ECHRNG) for scrub code to signal that it needs to wait for
deferred intent work items to drain out of some part of the filesystem.
When this happens, set a new scrub state flag (XCHK_NEED_DRAIN) so that
setup functions will activate the jump label.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: minimize overhead of drain wakeups by using jump labels
Darrick J. Wong [Wed, 12 Apr 2023 01:59:59 +0000 (18:59 -0700)]
xfs: minimize overhead of drain wakeups by using jump labels

To reduce the runtime overhead even further when online fsck isn't
running, use a static branch key to decide if we call wake_up on the
drain.  For compilers that support jump labels, the call to wake_up is
replaced by a nop sled when nobody is waiting for intents to drain.

From my initial microbenchmarking, every transition of the static key
between the on and off states takes about 22000ns to complete; this is
paid entirely by the xfs_scrub process.  When the static key is off
(which it should be when fsck isn't running), the nop sled adds an
overhead of approximately 0.36ns to runtime code.  The post-atomic
lockless waiter check adds about 0.03ns, which is basically free.

For the few compilers that don't support jump labels, runtime code pays
the cost of calling wake_up on an empty waitqueue, which was observed to
be about 30ns.  However, most architectures that have sufficient memory
and CPU capacity to run XFS also support jump labels, so this is not
much of a worry.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: clean up scrub context if scrub setup returns -EDEADLOCK
Darrick J. Wong [Wed, 12 Apr 2023 01:59:59 +0000 (18:59 -0700)]
xfs: clean up scrub context if scrub setup returns -EDEADLOCK

It has been a longstanding convention that online scrub and repair
functions can return -EDEADLOCK to signal that they weren't able to
obtain some necessary resource.  When this happens, the scrub framework
is supposed to release all resources attached to the scrub context, set
the TRY_HARDER flag in the scrub context flags, and try again.  In this
context, individual scrub functions are supposed to take all the
resources they (incorrectly) speculated were not necessary.

We're about to make it so that the functions that lock and wait for a
filesystem AG can also return EDEADLOCK to signal that we need to try
again with the drain waiters enabled.  Therefore, refactor
xfs_scrub_metadata to support this behavior for ->setup() functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: allow queued AG intents to drain before scrubbing
Darrick J. Wong [Wed, 12 Apr 2023 01:59:58 +0000 (18:59 -0700)]
xfs: allow queued AG intents to drain before scrubbing

When a writer thread executes a chain of log intent items, the AG header
buffer locks will cycle during a transaction roll to get from one intent
item to the next in a chain.  Although scrub takes all AG header buffer
locks, this isn't sufficient to guard against scrub checking an AG while
that writer thread is in the middle of finishing a chain because there's
no higher level locking primitive guarding allocation groups.

When there's a collision, cross-referencing between data structures
(e.g. rmapbt and refcountbt) yields false corruption events; if repair
is running, this results in incorrect repairs, which is catastrophic.

Fix this by adding to the perag structure the count of active intents
and make scrub wait until it has both AG header buffer locks and the
intent counter reaches zero.

One quirk of the drain code is that deferred bmap updates also bump and
drop the intent counter.  A fundamental decision made during the design
phase of the reverse mapping feature is that updates to the rmapbt
records are always made by the same code that updates the primary
metadata.  In other words, callers of bmapi functions expect that the
bmapi functions will queue deferred rmap updates.

Some parts of the reflink code queue deferred refcount (CUI) and bmap
(BUI) updates in the same head transaction, but the deferred work
manager completely finishes the CUI before the BUI work is started.  As
a result, the CUI drops the intent count long before the deferred rmap
(RUI) update even has a chance to bump the intent count.  The only way
to keep the intent count elevated between the CUI and RUI is for the BUI
to bump the counter until the RUI has been created.

A second quirk of the intent drain code is that deferred work items must
increment the intent counter as soon as the work item is added to the
transaction.  When a BUI completes and queues an RUI, the RUI must
increment the counter before the BUI decrements it.  The only way to
accomplish this is to require that the counter be bumped as soon as the
deferred work item is created in memory.

In the next patches we'll improve on this facility, but this patch
provides the basic functionality.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: add a tracepoint to report incorrect extent refcounts
Darrick J. Wong [Wed, 12 Apr 2023 01:59:58 +0000 (18:59 -0700)]
xfs: add a tracepoint to report incorrect extent refcounts

Add a new tracepoint so that I can see exactly what and where we failed
the refcount check.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: update copyright years for scrub/ files
Darrick J. Wong [Wed, 12 Apr 2023 01:59:57 +0000 (18:59 -0700)]
xfs: update copyright years for scrub/ files

Update the copyright years in the scrub/ source code files.  This isn't
required, but it's helpful to remind myself just how long it's taken to
develop this feature.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: fix author and spdx headers on scrub/ files
Darrick J. Wong [Wed, 12 Apr 2023 01:59:56 +0000 (18:59 -0700)]
xfs: fix author and spdx headers on scrub/ files

Fix the spdx tags to match current practice, and update the author
contact information.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: create traced helper to get extra perag references
Darrick J. Wong [Wed, 12 Apr 2023 01:59:55 +0000 (18:59 -0700)]
xfs: create traced helper to get extra perag references

There are a few places in the XFS codebase where a caller has either an
active or a passive reference to a perag structure and wants to give
a passive reference to some other piece of code.  Btree cursor creation
and inode walks are good examples of this.  Replace the open-coded logic
with a helper to do this.

The new function adds a few safeguards -- it checks that there's at
least one reference to the perag structure passed in, and it records the
refcount bump in the ftrace information.  This makes it much easier to
debug perag refcounting problems.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: give xfs_refcount_intent its own perag reference
Darrick J. Wong [Wed, 12 Apr 2023 01:59:55 +0000 (18:59 -0700)]
xfs: give xfs_refcount_intent its own perag reference

Give the xfs_refcount_intent a passive reference to the perag structure
data.  This reference will be used to enable scrub intent draining
functionality in subsequent patches.  Any space being modified by a
refcount intent is already allocated, so we need to be able to operate
even if the AG is being shrunk or offlined.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: give xfs_rmap_intent its own perag reference
Darrick J. Wong [Wed, 12 Apr 2023 01:59:54 +0000 (18:59 -0700)]
xfs: give xfs_rmap_intent its own perag reference

Give the xfs_rmap_intent a passive reference to the perag structure
data.  This reference will be used to enable scrub intent draining
functionality in subsequent patches.  The space we're (reverse) mapping
is already allocated, so we need to be able to operate even if the AG is
being shrunk or offlined.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: give xfs_extfree_intent its own perag reference
Darrick J. Wong [Wed, 12 Apr 2023 01:59:54 +0000 (18:59 -0700)]
xfs: give xfs_extfree_intent its own perag reference

Give the xfs_extfree_intent an passive reference to the perag structure
data.  This reference will be used to enable scrub intent draining
functionality in subsequent patches.  The space being freed must already
be allocated, so we need to able to run even if the AG is being offlined
or shrunk.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: pass per-ag references to xfs_free_extent
Darrick J. Wong [Wed, 12 Apr 2023 01:59:53 +0000 (18:59 -0700)]
xfs: pass per-ag references to xfs_free_extent

Pass a reference to the per-AG structure to xfs_free_extent.  Most
callers already have one, so we can eliminate unnecessary lookups.  The
one exception to this is the EFI code, which the next patch will fix.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: give xfs_bmap_intent its own perag reference
Darrick J. Wong [Wed, 12 Apr 2023 01:59:53 +0000 (18:59 -0700)]
xfs: give xfs_bmap_intent its own perag reference

Give the xfs_bmap_intent an active reference to the perag structure
data.  This reference will be used to enable scrub intent draining
functionality in subsequent patches.  Later, shrink will use these
passive references to know if an AG is quiesced or not.

The reason why we take a passive ref for a file mapping operation is
simple: we're committing to some sort of action involving space in an
AG, so we want to indicate our interest in that AG.  The space is
already allocated, so we need to be able to operate on AGs that are
offline or being shrunk.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: document future directions of online fsck
Darrick J. Wong [Wed, 12 Apr 2023 01:59:52 +0000 (18:59 -0700)]
xfs: document future directions of online fsck

Add the seventh and final chapter of the online fsck documentation,
where we talk about future functionality that can tie in with the
functionality provided by the online fsck patchset.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: document the userspace fsck driver program
Darrick J. Wong [Wed, 12 Apr 2023 01:59:51 +0000 (18:59 -0700)]
xfs: document the userspace fsck driver program

Add the sixth chapter of the online fsck design documentation, where
we discuss the details of the data structures and algorithms used by the
driver program xfs_scrub.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: document directory tree repairs
Darrick J. Wong [Wed, 12 Apr 2023 01:59:51 +0000 (18:59 -0700)]
xfs: document directory tree repairs

Directory tree repairs are the least complete part of online fsck, due
to the lack of directory parent pointers.  However, even without that
feature, we can still make some corrections to the directory tree -- we
can salvage as many directory entries as we can from a damaged
directory, and we can reattach orphaned inodes to the lost+found, just
as xfs_repair does now.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: document metadata file repair
Darrick J. Wong [Wed, 12 Apr 2023 01:59:50 +0000 (18:59 -0700)]
xfs: document metadata file repair

File-based metadata (such as xattrs and directories) can be extremely
large.  To reduce the memory requirements and maximize code reuse, it is
very convenient to create a temporary file, use the regular dir/attr
code to store salvaged information, and then atomically swap the extents
between the file being repaired and the temporary file.  Record the high
level concepts behind how temporary files and atomic content swapping
should work, and then present some case studies of what the actual
repair functions do.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: document full filesystem scans for online fsck
Darrick J. Wong [Wed, 12 Apr 2023 01:59:50 +0000 (18:59 -0700)]
xfs: document full filesystem scans for online fsck

Certain parts of the online fsck code need to scan every file in the
entire filesystem.  It is not acceptable to block the entire filesystem
while this happens, which means that we need to be clever in allowing
scans to coordinate with ongoing filesystem updates.  We also need to
hook the filesystem so that regular updates propagate to the staging
records.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: document online file metadata repair code
Darrick J. Wong [Wed, 12 Apr 2023 01:59:49 +0000 (18:59 -0700)]
xfs: document online file metadata repair code

Add to the fifth chapter of the online fsck design documentation, where
we discuss the details of the data structures and algorithms used by the
kernel to repair file metadata.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: document btree bulk loading
Darrick J. Wong [Wed, 12 Apr 2023 01:59:49 +0000 (18:59 -0700)]
xfs: document btree bulk loading

Add a discussion of the btree bulk loading code, which makes it easy to
take an in-memory recordset and write it out to disk in an efficient
manner.  This also enables atomic switchover from the old to the new
structure with minimal potential for leaking the old blocks.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: document pageable kernel memory
Darrick J. Wong [Wed, 12 Apr 2023 01:59:48 +0000 (18:59 -0700)]
xfs: document pageable kernel memory

Add a discussion of pageable kernel memory, since online fsck needs
quite a bit more memory than most other parts of the filesystem to stage
records and other information.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: document how online fsck deals with eventual consistency
Darrick J. Wong [Wed, 12 Apr 2023 01:59:48 +0000 (18:59 -0700)]
xfs: document how online fsck deals with eventual consistency

Writes to an XFS filesystem employ an eventual consistency update model
to break up complex multistep metadata updates into small chained
transactions.  This is generally good for performance and scalability
because XFS doesn't need to prepare for enormous transactions, but it
also means that online fsck must be careful not to attempt a fsck action
unless it can be shown that there are no other threads processing a
transaction chain.  This part of the design documentation covers the
thinking behind the consistency model and how scrub deals with it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: document the filesystem metadata checking strategy
Darrick J. Wong [Wed, 12 Apr 2023 01:59:47 +0000 (18:59 -0700)]
xfs: document the filesystem metadata checking strategy

Begin the fifth chapter of the online fsck design documentation, where
we discuss the details of the data structures and algorithms used by the
kernel to examine filesystem metadata and cross-reference it around the
filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: document the user interface for online fsck
Darrick J. Wong [Wed, 12 Apr 2023 01:59:47 +0000 (18:59 -0700)]
xfs: document the user interface for online fsck

Start the fourth chapter of the online fsck design documentation, which
discusses the user interface and the background scrubbing service.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: document the testing plan for online fsck
Darrick J. Wong [Wed, 12 Apr 2023 01:59:46 +0000 (18:59 -0700)]
xfs: document the testing plan for online fsck

Start the third chapter of the online fsck design documentation.  This
covers the testing plan to make sure that both online and offline fsck
can detect arbitrary problems and correct them without making things
worse.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: document the general theory underlying online fsck design
Darrick J. Wong [Wed, 12 Apr 2023 01:59:45 +0000 (18:59 -0700)]
xfs: document the general theory underlying online fsck design

Start the second chapter of the online fsck design documentation.
This covers the general theory underlying how online fsck works.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoxfs: document the motivation for online fsck design
Darrick J. Wong [Wed, 12 Apr 2023 01:59:45 +0000 (18:59 -0700)]
xfs: document the motivation for online fsck design

Start the first chapter of the online fsck design documentation.
This covers the motivations for creating this in the first place.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
15 months agoLinux 6.3-rc6
Linus Torvalds [Sun, 9 Apr 2023 18:15:57 +0000 (11:15 -0700)]
Linux 6.3-rc6

15 months agoMerge tag 'perf_urgent_for_v6.3_rc6' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 9 Apr 2023 17:10:46 +0000 (10:10 -0700)]
Merge tag 'perf_urgent_for_v6.3_rc6' of git://git./linux/kernel/git/tip/tip

Pull perf fixes from Borislav Petkov:

 - Fix "same task" check when redirecting event output

 - Do not wait unconditionally for RCU on the event migration path if
   there are no events to migrate

* tag 'perf_urgent_for_v6.3_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Fix the same task check in perf_event_set_output
  perf: Optimize perf_pmu_migrate_context()

15 months agoMerge tag 'x86_urgent_for_v6.3_rc6' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 9 Apr 2023 17:00:16 +0000 (10:00 -0700)]
Merge tag 'x86_urgent_for_v6.3_rc6' of git://git./linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Add a new Intel Arrow Lake CPU model number

 - Fix a confusion about how to check the version of the ACPI spec which
   supports a "online capable" bit in the MADT table which lead to a
   bunch of boot breakages with Zen1 systems and VMs

* tag 'x86_urgent_for_v6.3_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu: Add model number for Intel Arrow Lake processor
  x86/acpi/boot: Correct acpi_is_processor_usable() check
  x86/ACPI/boot: Use FADT version to check support for online capable

15 months agoMerge tag 'cxl-fixes-6.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Linus Torvalds [Sun, 9 Apr 2023 16:45:46 +0000 (09:45 -0700)]
Merge tag 'cxl-fixes-6.3-rc6' of git://git./linux/kernel/git/cxl/cxl

Pull compute express link (cxl) fixes from Dan Williams:
 "Several fixes for driver startup regressions that landed during the
  merge window as well as some older bugs.

  The regressions were due to a lack of testing with what the CXL
  specification calls Restricted CXL Host (RCH) topologies compared to
  the testing with Virtual Host (VH) CXL topologies. A VH topology is
  typical PCIe while RCH topologies map CXL endpoints as Root Complex
  Integrated endpoints. The impact is some driver crashes on startup.

  This merge window also added compatibility for range registers (the
  mechanism that CXL 1.1 defined for mapping memory) to treat them like
  HDM decoders (the mechanism that CXL 2.0 defined for mapping
  Host-managed Device Memory). That work collided with the new region
  enumeration code that was tested with CXL 2.0 setups, and fails with
  crashes at startup.

  Lastly, the DOE (Data Object Exchange) implementation for retrieving
  an ACPI-like data table from CXL devices is being reworked for v6.4.
  Several fixes fell out of that work that are suitable for v6.3.

  All of this has been in linux-next for a while, and all reported
  issues [1] have been addressed.

  Summary:

   - Fix several issues with region enumeration in RCH topologies that
     can trigger crashes on driver startup or shutdown.

   - Fix CXL DVSEC range register compatibility versus region
     enumeration that leads to startup crashes

   - Fix CDAT endiannes handling

   - Fix multiple buffer handling boundary conditions

   - Fix Data Object Exchange (DOE) workqueue usage vs
     CONFIG_DEBUG_OBJECTS warn splats"

Link: http://lore.kernel.org/r/20230405075704.33de8121@canb.auug.org.au
* tag 'cxl-fixes-6.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
  cxl/hdm: Extend DVSEC range register emulation for region enumeration
  cxl/hdm: Limit emulation to the number of range registers
  cxl/region: Move coherence tracking into cxl_region_attach()
  cxl/region: Fix region setup/teardown for RCDs
  cxl/port: Fix find_cxl_root() for RCDs and simplify it
  cxl/hdm: Skip emulation when driver manages mem_enable
  cxl/hdm: Fix double allocation of @cxlhdm
  PCI/DOE: Fix memory leak with CONFIG_DEBUG_OBJECTS=y
  PCI/DOE: Silence WARN splat with CONFIG_DEBUG_OBJECTS=y
  cxl/pci: Handle excessive CDAT length
  cxl/pci: Handle truncated CDAT entries
  cxl/pci: Handle truncated CDAT header
  cxl/pci: Fix CDAT retrieval on big endian

15 months agoMerge tag '6.3-rc5-smb3-cifs-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Sun, 9 Apr 2023 01:37:45 +0000 (18:37 -0700)]
Merge tag '6.3-rc5-smb3-cifs-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs client fixes from Steve French:
 "Two cifs/smb3 client fixes, one for stable:

   - double lock fix for a cifs/smb1 reconnect path

   - DFS prefixpath fix for reconnect when server moved"

* tag '6.3-rc5-smb3-cifs-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: double lock in cifs_reconnect_tcon()
  cifs: sanitize paths in cifs_update_super_prepath.