Christoph Hellwig [Fri, 1 Jun 2018 16:03:07 +0000 (09:03 -0700)]
iomap: move IOMAP_F_BOUNDARY to gfs2
Just define a range of fs specific flags and use that in gfs2 instead of
exposing this internal flag globally.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Christoph Hellwig [Fri, 1 Jun 2018 16:03:07 +0000 (09:03 -0700)]
iomap: fix the comment describing IOMAP_NOWAIT
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Christoph Hellwig [Fri, 1 Jun 2018 16:03:06 +0000 (09:03 -0700)]
iomap: inline data should be an iomap type, not a flag
Inline data is fundamentally different from our normal mapped case in that
it doesn't even have a block address. So instead of having a flag for it
it should be an entirely separate iomap range type.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Christoph Hellwig [Fri, 1 Jun 2018 16:03:06 +0000 (09:03 -0700)]
mm: split ->readpages calls to avoid non-contiguous pages lists
That way file systems don't have to go spotting for non-contiguous pages
and work around them. It also kicks off I/O earlier, allowing it to
finish earlier and reduce latency.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Christoph Hellwig [Fri, 1 Jun 2018 16:03:05 +0000 (09:03 -0700)]
mm: return an unsigned int from __do_page_cache_readahead
We never return an error, so switch to returning an unsigned int. Most
callers already did implicit casts to an unsigned type, and the one that
didn't can be simplified now.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Christoph Hellwig [Fri, 1 Jun 2018 16:03:05 +0000 (09:03 -0700)]
mm: give the 'ret' variable a better name __do_page_cache_readahead
It counts the number of pages acted on, so name it nr_pages to make that
obvious.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Christoph Hellwig [Fri, 1 Jun 2018 16:03:05 +0000 (09:03 -0700)]
block: add a lower-level bio_add_page interface
For the upcoming removal of buffer heads in XFS we need to keep track of
the number of outstanding writeback requests per page. For this we need
to know if bio_add_page merged a region with the previous bvec or not.
Instead of adding additional arguments this refactors bio_add_page to
be implemented using three lower level helpers which users like XFS can
use directly if they care about the merge decisions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Dave Chinner [Thu, 31 May 2018 23:49:00 +0000 (16:49 -0700)]
xfs: fix error handling in xfs_refcount_insert()
generic/475 fired an assert failure just after the filesystem was
shut down:
XFS: Assertion failed: fs_is_ok, file: fs/xfs/libxfs/xfs_refcount.c, line: 182
.....
Call Trace:
xfs_refcount_insert+0x151/0x190
xfs_refcount_adjust_extents.constprop.11+0x9c/0x470
xfs_refcount_adjust.constprop.10+0xb0/0x270
xfs_refcount_finish_one+0x25a/0x420
xfs_trans_log_finish_refcount_update+0x2a/0x40
xfs_refcount_update_finish_item+0x35/0xa0
xfs_defer_finish+0x15e/0x4d0
xfs_reflink_remap_extent+0x1bc/0x610
xfs_reflink_remap_blocks+0x6e/0x280
xfs_reflink_remap_range+0x311/0x530
vfs_clone_file_range+0x119/0x200
....
If xfs_btree_insert() returns an error, the corruption check fires
instead of passing the error back the caller. The corruption check
should be after we've checked for an error, not before, thereby
avoiding assert failures if the filesystem shuts down during a
refcount btree record insert.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Darrick J. Wong [Thu, 31 May 2018 16:12:10 +0000 (09:12 -0700)]
xfs: fix xfs_rtalloc_rec units
All the realtime allocation functions deal with space on the rtdev in
units of realtime extents. However, struct xfs_rtalloc_rec confusingly
uses the word 'block' in the name, even though they're really extents.
Fix the naming problem and fix all the unit handling problems in the two
existing users.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Darrick J. Wong [Thu, 31 May 2018 16:07:21 +0000 (09:07 -0700)]
xfs: strengthen rtalloc query range checks
Strengthen the rtalloc range query checks to make sure that the keys do
not run off the end of the realtime device inappropriately. Note that
the query range functions require units of rt extents, not blocks,
despite the type name.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Darrick J. Wong [Thu, 31 May 2018 16:07:20 +0000 (09:07 -0700)]
xfs: xfs_rtbuf_get should check the bmapi_read results
The xfs_rtbuf_get function should check the block mapping it gets back
from bmapi_read. If there are no mappings or the mapping isn't a real
extent, we should return -EFSCORRUPTED rather than trying to read a
garbage value. We also require realtime bitmap blocks to be real,
written allocations.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Darrick J. Wong [Thu, 31 May 2018 16:07:20 +0000 (09:07 -0700)]
xfs: xfs_rtword_t should be unsigned, not signed
xfs_rtword_t is used for bit manipulations in the realtime bitmap file.
Since we're performing bit shifts with this type, we don't want sign
extension and we don't want to be left shifting negative quantities
because that's undefined behavior.
This also shuts up these UBSAN warnings:
UBSAN: Undefined behaviour in fs/xfs/libxfs/xfs_rtbitmap.c:833:48
signed integer overflow:
-
2147483648 - 1 cannot be represented in type 'int'
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Dave Jiang [Wed, 30 May 2018 20:03:46 +0000 (13:03 -0700)]
dax: change bdev_dax_supported() to support boolean returns
The function return values are confusing with the way the function is
named. We expect a true or false return value but it actually returns
0/-errno. This makes the code very confusing. Changing the return values
to return a bool where if DAX is supported then return true and no DAX
support returns false.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Darrick J. Wong [Wed, 30 May 2018 20:03:45 +0000 (13:03 -0700)]
fs: allow per-device dax status checking for filesystems
Change bdev_dax_supported so it takes a bdev parameter. This enables
multi-device filesystems like xfs to check that a dax device can work for
the particular filesystem. Once that's in place, actually fix all the
parts of XFS where we need to be able to distinguish between datadev and
rtdev.
This patch fixes the problem where we screw up the dax support checking
in xfs if the datadev and rtdev have different dax capabilities.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[rez: Re-added __bdev_dax_supported() for !CONFIG_FS_DAX cases]
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Darrick J. Wong [Wed, 30 May 2018 05:18:12 +0000 (22:18 -0700)]
xfs: repair superblocks
If one of the backup superblocks is found to differ seriously from
superblock 0, write out a fresh copy from the in-core sb.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Darrick J. Wong [Wed, 30 May 2018 05:18:11 +0000 (22:18 -0700)]
xfs: add helpers to attach quotas to inodes
Add a helper routine to attach quota information to inodes that are
about to undergo repair. If that fails, we need to schedule a
quotacheck for the next mount but allow the corrupted metadata repair to
continue.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Darrick J. Wong [Wed, 30 May 2018 05:18:10 +0000 (22:18 -0700)]
xfs: recover AG btree roots from rmap data
Add a helper function to help us recover btree roots from the rmap data.
Callers pass in a list of rmap owner codes, buffer ops, and magic
numbers. We iterate the rmap records looking for owner matches, and
then read the matching blocks to see if the magic number & uuid match.
If so, we then read-verify the block, and if that passes then we retain
a pointer to the block with the highest level, assuming that by the end
of the call we will have found the root. This will be used to reset the
AGF/AGI btree root fields during their rebuild procedures.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Darrick J. Wong [Wed, 30 May 2018 05:18:10 +0000 (22:18 -0700)]
xfs: add helpers to dispose of old btree blocks after a repair
Now that we've plumbed in the ability to construct a list of dead btree
blocks following a repair, add more helpers to dispose of them. This is
done by examining the rmapbt -- if the btree was the only owner we can
free the block, otherwise it's crosslinked and we can only remove the
rmapbt record.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Darrick J. Wong [Wed, 30 May 2018 05:18:09 +0000 (22:18 -0700)]
xfs: add helpers to collect and sift btree block pointers during repair
Add some helpers to assemble a list of fs block extents. Generally,
repair functions will iterate the rmapbt to make a list (1) of all
extents owned by the nominal owner of the metadata structure; then they
will iterate all other structures with the same rmap owner to make a
list (2) of active blocks; and finally we have a subtraction function to
subtract all the blocks in (2) from (1), with the result that (1) is now
a list of blocks that were owned by the old btree and must be disposed.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Darrick J. Wong [Wed, 30 May 2018 05:18:09 +0000 (22:18 -0700)]
xfs: add helpers to allocate and initialize fresh btree roots
Add a pair of helper functions to allocate and initialize fresh btree
roots. The repair functions will use these as part of recreating
corrupted metadata.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Darrick J. Wong [Wed, 30 May 2018 05:18:08 +0000 (22:18 -0700)]
xfs: add helpers to deal with transaction allocation and rolling
For repairs, we need to reserve at least as many blocks as we think
we're going to need to rebuild the data structure, and we're going to
need some helpers to roll transactions while maintaining locks on the AG
headers so that other threads cannot wander into the middle of a repair.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Darrick J. Wong [Wed, 30 May 2018 05:24:44 +0000 (22:24 -0700)]
xfs: grab the per-ag structure whenever relevant
Grab and hold the per-AG data across a scrub run whenever relevant.
This helps us avoid repeated trips through rcu and the radix tree
in the repair code.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Souptick Joarder [Tue, 29 May 2018 17:39:03 +0000 (10:39 -0700)]
fs: xfs: Change return type to vm_fault_t
Use new return type vm_fault_t for fault handlers.
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Darrick J. Wong [Thu, 24 May 2018 15:54:59 +0000 (08:54 -0700)]
xfs: fix inobt magic number check
In commit
a6a781a58befcbd467c ("xfs: have buffer verifier functions
report failing address") the bad magic number return was ported
incorrectly.
Fixes:
a6a781a58befcbd467ce843af4eaca3906aa1f08
Reported-by: syzbot+08ab33be0178b76851c8@syzkaller.appspotmail.com
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Darrick J. Wong [Tue, 22 May 2018 18:48:08 +0000 (11:48 -0700)]
fs: clear writeback errors in inode_init_always
In inode_init_always(), we clear the inode mapping flags, which clears
any retained error (AS_EIO, AS_ENOSPC) bits. Unfortunately, we do not
also clear wb_err, which means that old mapping errors can leak through
to new inodes.
This is crucial for the XFS inode allocation path because we recycle old
in-core inodes and we do not want error state from an old file to leak
into the new file. This bug was discovered by running generic/036 and
generic/047 in a loop and noticing that the EIOs generated by the
collision of direct and buffered writes in generic/036 would survive the
remount between 036 and 047, and get reported to the fsyncs (on
different files!) in generic/047.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Omar Sandoval [Wed, 16 May 2018 18:13:34 +0000 (11:13 -0700)]
iomap: don't allow holes in swapfiles
generic_swapfile_activate() doesn't allow holes, so we should be
consistent here. This is also a bit safer: if the user creates a
swapfile with, say, truncate -s $SIZE followed by mkswap, they should
really get an error and not much less swap space than they expected.
swapon(8) will error out before calling swapon(2) if the file has holes,
anyways.
Fixes:
9d93388b0afe ("iomap: add a swapfile activation function")
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Omar Sandoval [Wed, 16 May 2018 18:13:34 +0000 (11:13 -0700)]
iomap: provide more useful errors for invalid swap files
Currently, for an invalid swap file, we print the same error message
regardless of the reason. This isn't very useful for an admin, who will
likely want to know why exactly they can't use their swap file. So,
let's add specific error messages for each reason, and also move the
bdev check after the flags checks, since the latter are more
fundamental.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Eric Sandeen [Tue, 15 May 2018 20:21:48 +0000 (13:21 -0700)]
xfs: implement online get/set fs label
The GET ioctl is trivial, just return the current label.
The SET ioctl is more involved:
It transactionally modifies the superblock to write a new filesystem
label to the primary super.
A new variant of xfs_sync_sb then writes the superblock buffer
immediately to disk so that the change is visible from userspace.
It then invalidates any page cache that userspace might have previously
read on the block device so that i.e. blkid can see the change
immediately, and updates all secondary superblocks as userspace relable
does.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[darrick: use dchinner's new xfs_update_secondary_sbs function]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Eric Sandeen [Tue, 15 May 2018 20:20:03 +0000 (13:20 -0700)]
fs: copy BTRFS_IOC_[SG]ET_FSLABEL to vfs
This retains 256 chars as the maximum size through the interface, which
is the btrfs limit and AFAIK exceeds any other filesystem's maximum
label size.
This just copies the ioctl for now and leaves it in place for btrfs
for the time being. A later patch will allow btrfs to use the new
common ioctl definition, but it may be sent after this is merged.
(Note, Reviewed-by's were originally given for the combined vfs+btrfs
patch, some license taken here.)
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Dave Chinner [Mon, 14 May 2018 06:10:08 +0000 (23:10 -0700)]
xfs: factor the ag length extension code into libxfs
Growfs currently manually codes the extension of the last AG in a
filesytem during the growfs process. Factor that out of the growfs
code and move it into libxfs along with teh rest of the AG header
modification code.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Dave Chinner [Mon, 14 May 2018 06:10:08 +0000 (23:10 -0700)]
xfs: move growfs core to libxfs
So it can be shared with userspace (e.g. mkfs) easily.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Dave Chinner [Mon, 14 May 2018 06:10:08 +0000 (23:10 -0700)]
xfs: rework secondary superblock updates in growfs
Right now we wait until we've committed changes to the primary
superblock before we initialise any of the new secondary
superblocks. This means that if we have any write errors for new
secondary superblocks we end up with garbage in place rather than
zeros or even an "in progress" superblock to indicate a grow
operation is being done.
To ensure we can write the secondary superblocks, initialise them
earlier in the same loop that initialises the AG headers. We stamp
the new secondary superblocks here with the old geometry, but set
the "sb_inprogress" field to indicate that updates are being done to
the superblock so they cannot be used. This will result in the
secondary superblock fields being updated or triggering errors that
will abort the grow before we commit any permanent changes.
This also means we can change the update mechanism of the secondary
superblocks. We know that we are going to wholly overwrite the
information in the struct xfs_sb in the buffer, so there's no point
reading it from disk. Just allocate an uncached buffer, zero it in
memory, stamp the new superblock structure in it and write it out.
If we fail to write it out, then we'll leave the existing sb (old or
new w/ inprogress) on disk for repair to deal with later.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Dave Chinner [Mon, 14 May 2018 06:10:07 +0000 (23:10 -0700)]
xfs: separate secondary sb update in growfs
This happens after all the transactions to update the superblock
occur, and errors need to be handled slightly differently. Seperate
out the code into it's own function, and clean up the error goto
stack in the core growfs code as it is now much simpler.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Dave Chinner [Mon, 14 May 2018 06:10:07 +0000 (23:10 -0700)]
xfs: make imaxpct changes in growfs separate
When growfs changes the imaxpct value of the filesystem, it runs
through all the "change size" growfs code, whether it needs to or
not. Separate out changing imaxpct into it's own function and
transaction to simplify the rest of the growfs code.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Dave Chinner [Mon, 14 May 2018 06:10:06 +0000 (23:10 -0700)]
xfs: turn ag header initialisation into a table driven operation
There's still more cookie cutter code in setting up each AG header.
Separate all the variables into a simple structure and iterate a
table of header definitions to initialise everything.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Dave Chinner [Mon, 14 May 2018 06:10:06 +0000 (23:10 -0700)]
xfs: factor ag btree root block initialisation
Cookie cutter code, easily factored.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Dave Chinner [Mon, 14 May 2018 06:10:06 +0000 (23:10 -0700)]
xfs: convert growfs AG header init to use buffer lists
We currently write all new AG headers synchronously, which can be
slow for large grow operations. All we really need to do is ensure
all the headers are on disk before we run the growfs transaction, so
convert this to a buffer list and a delayed write operation. We
block waiting for the delayed write buffer submission to complete,
so this will fulfill the requirement to have all the buffers written
correctly before proceeding.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Dave Chinner [Mon, 14 May 2018 06:10:05 +0000 (23:10 -0700)]
xfs: factor out AG header initialisation from growfs core
The intialisation of new AG headers is mostly common with the
userspace mkfs code and growfs in the kernel, so start factoring it
out so we can move it to libxfs and use it in both places.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Dave Chinner [Mon, 14 May 2018 06:10:05 +0000 (23:10 -0700)]
xfs: one-shot cached buffers
For the new growfs work, we want to ensure that we serialise
secondary superblock updates with other operations (e.g. scrub)
correctly, but we don't want to cache the buffers for long term
reuse. We need cached buffers for serialisation, however.
To solve this, introduce a "oneshot" buffer which will be marshalled
through the cache but then released once the last current reference
goes away. If the buffer is already cached, then we ignore the
"one-shot" behaviour and leave the buffer in the state it was prior
to the one-shot command being run. This means we don't perturb
either the working set or existing cached buffer state by a one-shot
operation.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Darrick J. Wong [Mon, 14 May 2018 13:34:36 +0000 (06:34 -0700)]
xfs: implement the metadata repair ioctl flag
Plumb in the pieces necessary to make the "scrub" subfunction of
the scrub ioctl actually work. This means that we make the IFLAG_REPAIR
flag to the scrub ioctl actually do something, and we add an errortag
knob so that xfstests can force the kernel to rebuild a metadata
structure even if there's nothing wrong with it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Darrick J. Wong [Mon, 14 May 2018 13:34:35 +0000 (06:34 -0700)]
xfs: create tracepoints for online repair
These tracepoints will be used to debug the online repair routines.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Darrick J. Wong [Mon, 14 May 2018 13:34:35 +0000 (06:34 -0700)]
xfs: teach xfs_bmapi_remap to accept some bmapi flags
Teach xfs_bmapi_remap how to map in unwritten extent and to skip rmap
updates. This enables us to rebuild real and unwritten extents from the
rmapbt.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Darrick J. Wong [Mon, 14 May 2018 13:34:34 +0000 (06:34 -0700)]
xfs: make xfs_bmapi_remapi work with attribute forks
Add a new flags argument to xfs_bmapi_remapi so that we can pass BMAPI
flags into the function. This enables us to pass in BMAPI_ATTRFORK so
that we can remap things into the attribute fork. Eventually the
online repair code will use this to rebuild attribute forks, so make it
non-static.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Darrick J. Wong [Mon, 14 May 2018 13:34:34 +0000 (06:34 -0700)]
xfs: hoist xfs_scrub_agfl_walk to libxfs as xfs_agfl_walk
This function is basically a generic AGFL block iterator, so promote it
to libxfs ahead of online repair wanting to use it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Darrick J. Wong [Mon, 14 May 2018 13:34:34 +0000 (06:34 -0700)]
xfs: avoid ABBA deadlock when scrubbing parent pointers
In normal operation, the XFS convention is to take an inode's iolock
and then allocate a transaction. However, when scrubbing parent inodes
this is inverted -- we allocated the transaction to do the scrub, and
now we're trying to grab the parent's iolock. This can lead to ABBA
deadlocks: some thread grabbed the parent's iolock and is waiting for
space for a transaction while our parent scrubber is sitting on a
transaction trying to get the parent's iolock.
Therefore, convert all iolock attempts to use trylock; if that fails,
they can use the existing mechanisms to back off and try again.
The ABBA deadlock didn't happen with a non-repair scrub because the
transactions don't reserve any space, but repair scrubs require
reservation in order to update metadata. However, any other concurrent
metadata update (e.g. directory create in the parent) could also induce
this deadlock with the parent scrubber.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Darrick J. Wong [Mon, 14 May 2018 13:34:33 +0000 (06:34 -0700)]
xfs: scrub the data fork of the realtime inodes
The realtime bitmap and summary inodes live on the metadata device, so
we can scrub their data forks with the regular scrubbers.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Darrick J. Wong [Mon, 14 May 2018 13:34:33 +0000 (06:34 -0700)]
xfs: quota scrub should use bmapbtd scrubber
Replace the quota scrubber's open-coded data fork scrubber with a
redirected call to the bmapbtd scrubber. This strengthens the quota
scrub to include all the cross-referencing that it does.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Darrick J. Wong [Mon, 14 May 2018 13:34:32 +0000 (06:34 -0700)]
xfs: don't continue scrub if already corrupt
If we've already decided that something is corrupt, we might as well
abort all the loops and exit as quickly as possible.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Darrick J. Wong [Mon, 14 May 2018 13:34:32 +0000 (06:34 -0700)]
xfs: refactor quota limits initialization
Replace all the if (!error) weirdness with helper functions that follow
our regular coding practices, and factor out the ternary expression soup.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Darrick J. Wong [Mon, 14 May 2018 13:34:31 +0000 (06:34 -0700)]
xfs: superblock scrub should use short-lived buffers
Secondary superblocks are rarely used, so create a helper to read a
given non-primary AG's superblock and ensure that it won't stick around
hogging memory.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Darrick J. Wong [Mon, 14 May 2018 13:34:31 +0000 (06:34 -0700)]
xfs: skip scrub xref if corruption already noted
Don't bother looking for cross-referencing problems if the metadata is
already corrupt or we've already found a cross-referencing problem.
Since we added a helper function for flags testing, convert existing
users to use it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Dave Chinner [Fri, 11 May 2018 04:50:23 +0000 (21:50 -0700)]
xfs: clear sb->s_fs_info on mount failure
We recently had an oops reported on a 4.14 kernel in
xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
and so the m_perag_tree lookup walked into lala land.
Essentially, the machine was under memory pressure when the mount
was being run, xfs_fs_fill_super() failed after allocating the
xfs_mount and attaching it to sb->s_fs_info. It then cleaned up and
freed the xfs_mount, but the sb->s_fs_info field still pointed to
the freed memory. Hence when the superblock shrinker then ran
it fell off the bad pointer.
With the superblock shrinker problem fixed at teh VFS level, this
stale s_fs_info pointer is still a problem - we use it
unconditionally in ->put_super when the superblock is being torn
down, and hence we can still trip over it after a ->fill_super
call failure. Hence we need to clear s_fs_info if
xfs-fs_fill_super() fails, and we need to check if it's valid in
the places it can potentially be dereferenced after a ->fill_super
failure.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Dave Chinner [Fri, 11 May 2018 04:50:23 +0000 (21:50 -0700)]
xfs: add mount delay debug option
Similar to log_recovery_delay, this delay occurs between the VFS
superblock being initialised and the xfs_mount being fully
initialised. It also poisons the per-ag radix tree node so that it
can be used for triggering shrinker races during mount
such as the following:
<run memory pressure workload in background>
$ cat dirty-mount.sh
#! /bin/bash
umount -f /dev/pmem0
mkfs.xfs -f /dev/pmem0
mount /dev/pmem0 /mnt/test
rm -f /mnt/test/foo
xfs_io -fxc "pwrite 0 4k" -c fsync -c "shutdown" /mnt/test/foo
umount /dev/pmem0
# let's crash it now!
echo 30 > /sys/fs/xfs/debug/mount_delay
mount /dev/pmem0 /mnt/test
echo 0 > /sys/fs/xfs/debug/mount_delay
umount /dev/pmem0
$ sudo ./dirty-mount.sh
.....
[ 60.378118] CPU: 3 PID: 3577 Comm: fs_mark Tainted: G D W 4.16.0-rc5-dgc #440
[ 60.378120] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[ 60.378124] RIP: 0010:radix_tree_next_chunk+0x76/0x320
[ 60.378127] RSP: 0018:
ffffc9000276f4f8 EFLAGS:
00010282
[ 60.383670] RAX:
a5a5a5a5a5a5a5a4 RBX:
0000000000000010 RCX:
000000000000001a
[ 60.385277] RDX:
0000000000000000 RSI:
ffffc9000276f540 RDI:
0000000000000000
[ 60.386554] RBP:
0000000000000000 R08:
0000000000000000 R09:
a5a5a5a5a5a5a5a5
[ 60.388194] R10:
0000000000000006 R11:
0000000000000001 R12:
ffffc9000276f598
[ 60.389288] R13:
0000000000000040 R14:
0000000000000228 R15:
ffff880816cd6458
[ 60.390827] FS:
00007f5c124b9740(0000) GS:
ffff88083fc00000(0000) knlGS:
0000000000000000
[ 60.392253] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 60.393423] CR2:
00007f5c11bba0b8 CR3:
000000035580e001 CR4:
00000000000606e0
[ 60.394519] Call Trace:
[ 60.395252] radix_tree_gang_lookup_tag+0xc4/0x130
[ 60.395948] xfs_perag_get_tag+0x37/0xf0
[ 60.396522] xfs_reclaim_inodes_count+0x32/0x40
[ 60.397178] xfs_fs_nr_cached_objects+0x11/0x20
[ 60.397837] super_cache_count+0x35/0xc0
[ 60.399159] shrink_slab.part.66+0xb1/0x370
[ 60.400194] shrink_node+0x7e/0x1a0
[ 60.401058] try_to_free_pages+0x199/0x470
[ 60.402081] __alloc_pages_slowpath+0x3a1/0xd20
[ 60.403729] __alloc_pages_nodemask+0x1c3/0x200
[ 60.404941] cache_grow_begin+0x20b/0x2e0
[ 60.406164] fallback_alloc+0x160/0x200
[ 60.407088] kmem_cache_alloc+0x111/0x4e0
[ 60.408038] ? xfs_buf_rele+0x61/0x430
[ 60.408925] kmem_zone_alloc+0x61/0xe0
[ 60.409965] xfs_inode_alloc+0x24/0x1d0
.....
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Brian Foster [Thu, 10 May 2018 16:35:42 +0000 (09:35 -0700)]
xfs: factor out nodiscard helpers
The changes to skip discards of speculative preallocation and
unwritten extents introduced several new wrapper functions through
the bunmapi -> extent free codepath to reduce churn in all of the
associated callers. In several cases, these wrappers simply toggle a
single flag to skip or not skip discards for the resulting blocks.
The explicit _nodiscard() wrappers for such an isolated set of
callers is a bit overkill. Kill off these wrappers and replace with
the calls to the underlying functions in the contexts that need to
control discard behavior. Retain the wrappers that preserve the
original calling conventions to serve the original purpose of
reducing code churn.
This is a refactoring patch and does not change behavior.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Darrick J. Wong [Thu, 10 May 2018 15:38:15 +0000 (08:38 -0700)]
iomap: add a swapfile activation function
Add a new iomap_swapfile_activate function so that filesystems can
activate swap files without having to use the obsolete and slow bmap
function. This enables XFS to support fallocate'd swap files and
swap files on realtime devices.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Darrick J. Wong [Wed, 9 May 2018 17:03:56 +0000 (10:03 -0700)]
xfs: halt auto-reclamation activities while rebuilding rmap
Rebuilding the reverse-mapping tree requires us to quiesce all inodes in
the filesystem, so we must stop background reclamation of post-EOF and
CoW prealloc blocks.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Darrick J. Wong [Wed, 9 May 2018 17:02:32 +0000 (10:02 -0700)]
xfs: add BMAPI_NORMAP flag to perform block remapping without updating rmapbt
Add a new flag, XFS_BMAPI_NORMAP, which will perform file block
remapping without updating the rmapbt. This will be used by the repair
code to reconstruct bmbts from the rmapbt, in which case we don't want
the rmapbt update.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Darrick J. Wong [Wed, 9 May 2018 17:02:03 +0000 (10:02 -0700)]
xfs: add repair helpers for the reference count btree
Add a couple of functions to the refcount btree and generic btree code
that will be used to repair the refcountbt.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Darrick J. Wong [Wed, 9 May 2018 17:02:02 +0000 (10:02 -0700)]
xfs: add repair helpers for the reverse mapping btree
Add a couple of functions to the reverse mapping btree that will be used
to repair the rmapbt.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Darrick J. Wong [Wed, 9 May 2018 17:02:02 +0000 (10:02 -0700)]
xfs: expose various functions to repair code
Expose various helpers that the repair code will want to use.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Darrick J. Wong [Wed, 9 May 2018 17:02:01 +0000 (10:02 -0700)]
xfs: add helpers to calculate btree size
Add a bunch of helper functions that calculate the sizes of various
btrees. These will be used to repair btrees and btree headers.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Darrick J. Wong [Wed, 9 May 2018 17:02:01 +0000 (10:02 -0700)]
xfs: refactor scrub transaction allocation function
Since the transaction allocation helper is about to become more complex,
move it to common.c and remove the redundant parameters.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Darrick J. Wong [Wed, 9 May 2018 17:02:00 +0000 (10:02 -0700)]
xfs: btree scrub should check minrecs
Strengthen the btree block header checks to detect the number of records
being less than the btree type's minimum record count. Certain blocks
are allowed to violate this constraint -- specifically any btree block
at the top of the tree can have fewer than minrecs records.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Darrick J. Wong [Wed, 9 May 2018 17:02:00 +0000 (10:02 -0700)]
xfs: clean up scrub usage of KM_NOFS
All scrub code runs in transaction context, which means that memory
allocations are automatically run in PF_MEMALLOC_NOFS context. It's
therefore unnecessary to pass in KM_NOFS to allocation routines, so
clean them all out.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Darrick J. Wong [Wed, 9 May 2018 17:02:00 +0000 (10:02 -0700)]
xfs: avoid ilock games in the quota scrubber
Refactor the quota scrubber to take the quotaofflock and grab the quota
inode in the setup function so that we can treat quota in the same
"scrub in the context of this inode" (i.e. sc->ip) manner as we treat
any other inode. We do have to drop the quota inode's ILOCK_EXCL to use
dqiterate, but since dquots have their own individual locks the ILOCK
wasn't helping us anyway.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Darrick J. Wong [Fri, 4 May 2018 22:31:21 +0000 (15:31 -0700)]
xfs: refactor dquot iteration
Create a helper function to iterate all the dquots of a given type in
the system, and refactor the dquot scrub to use it. This will get more
use in the quota repair code.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 4 May 2018 22:31:20 +0000 (15:31 -0700)]
xfs: rename on-disk dquot counter zap functions
The function 'xfs_qm_dqiterate' doesn't iterate dquots at all, it
iterates all dquot blocks of a quota inode and clears the counters.
Therefore, change the name to something more descriptive so that we can
introduce a real dquot iterator later.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 4 May 2018 22:30:24 +0000 (15:30 -0700)]
xfs: replace XFS_QMOPT_DQALLOC with a simple boolean
DQALLOC is only ever used with xfs_qm_dqget*, and the only flag that the
_dqget family of functions cares about is DQALLOC. Therefore, change
it to a boolean 'can alloc?' flag for the dqget interfaces where that
makes sense.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Darrick J. Wong [Fri, 4 May 2018 22:30:23 +0000 (15:30 -0700)]
xfs: remove direct calls to _qm_dqread
The quota initialization code needs an "uncached" variant of _dqget to
read in default quota limits and timers before the dquot cache is fully
set up. We've already split up _dqget into its component pieces so
create a fourth variant to address this need, and make dqread internal
to xfs_dquot.c again.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 4 May 2018 22:30:23 +0000 (15:30 -0700)]
xfs: refactor xfs_qm_dqtobp and xfs_qm_dqalloc
Separate the disk dquot read and allocation functionality into
two helper functions, then refactor dqread to call them directly.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Darrick J. Wong [Fri, 4 May 2018 22:30:23 +0000 (15:30 -0700)]
xfs: refactor incore dquot initialization functions
Create two incore dquot initialization functions that will help us to
disentangle dqget and dqread.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 4 May 2018 22:30:22 +0000 (15:30 -0700)]
xfs: fetch dquots directly during quotacheck
Quotacheck only runs during mount, which means that there are no other
processes in the system that could be doing chown or chproj. Therefore
there's no potential for racing to attach dquots to the inode so we can
drop all the ILOCK and race detection bits from quotacheck.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 4 May 2018 22:30:22 +0000 (15:30 -0700)]
xfs: split out dqget for inodes from regular dqget
There are two uses of dqget here -- one is to return the dquot for a
given type and id, and the other is to return the dquot for a given type
and inode. Those are two separate things, so split them into two
smaller functions.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 4 May 2018 22:30:21 +0000 (15:30 -0700)]
xfs: remove unnecessary xfs_qm_dqattach parameter
The flags argument is always zero, get rid of it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 4 May 2018 22:30:21 +0000 (15:30 -0700)]
xfs: delegate dqget input checks to helper function
Move the dqget input checks to a separate function in preparation for
splitting up the dqget functionality.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 4 May 2018 22:30:20 +0000 (15:30 -0700)]
xfs: refactor dquot cache handling
Delegate the dquot cache handling (radix tree lookup and insertion) to
separate helper functions so that we can continue to simplify the body
of dqget.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 4 May 2018 22:30:20 +0000 (15:30 -0700)]
xfs: refactor XFS_QMOPT_DQNEXT out of existence
There's only one caller of DQNEXT and its semantics can be moved into a
separate function, so create the function and get rid of the flag.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Fri, 4 May 2018 22:30:20 +0000 (15:30 -0700)]
xfs: don't spray logs when dquot flush/purge fail
When dquot flush or purge fail there's no need to spam the logs, we've
already logged the IO error or fs shutdown that caused the flush
failures.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Darrick J. Wong [Fri, 4 May 2018 22:30:19 +0000 (15:30 -0700)]
xfs: release new dquot buffer on defer_finish error
In commit
efa092f3d4c6 "[XFS] Fixes a bug in the quota code when
allocating a new dquot record", we allocate a new dquot block, grab a
buffer to initialize it, and return the locked initialized dquot buffer
to the caller for further in-core dquot initialization. Unfortunately,
if the _bmap_finish errored out, _qm_dqalloc would also error out
without bothering to free the (locked) buffer. Leaking a locked buffer
caused hangs in generic/388 when quotas are enabled.
Furthermore, the _bmap_finish -> _defer_finish conversion in
310a75a3c6c747 ("xfs: change xfs_bmap_{finish,cancel,init,free} ->
xfs_defer_*") failed to observe that the buffer was held going into
_defer_finish and therefore failed to notice that the buffer lock is
/not/ maintained afterwards. Now that we can bjoin a buffer to a
defer_ops, use this mechanism to ensure that the buffer stays locked
across the _defer_finish. Release the holds and locks on the buffer as
appropriate if we have to error out.
There is a subtlety here for the caller in that the buffer emerges
locked and held to the transaction, so if the _trans_commit fails we
have to release the buffer explicitly. This fixes the unmount hang
in generic/388 when quotas are enabled.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Brian Foster [Wed, 9 May 2018 15:45:05 +0000 (08:45 -0700)]
xfs: don't discard on free of unwritten extents
Unwritten extents by definition have not been written to until they
are converted to normal written extents. If unwritten extents are
freed from a file, it is therefore guaranteed that the blocks have
not been written to since allocation (note that zero range punches
and reallocates blocks).
To cut down on online discards generated from workloads that make
use of preallocation, skip discards of extents if they are in the
unwritten state when the extent is freed.
Note that this optimization does not apply to log recovery, during
which all freed extents are discarded if online discard is enabled.
Also note that it may be possible for a filesystem crash to occur
after write completion of an unwritten extent but before unwritten
conversion such that the extent remains unwritten after log
recovery. Since this pseudo-inconsistency may already be possible
after a crash (consider writing to recently allocated blocks where
the allocation transaction is lost after a crash), this change
shouldn't introduce any fundamental limitations that don't already
exist. In short, on storage stacks where discards are important,
it's good practice to run an occasional fstrim even with online
discard enabled in the filesystem, particularly after a crash.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Brian Foster [Wed, 9 May 2018 15:45:04 +0000 (08:45 -0700)]
xfs: skip online discard during eofblocks trims
We've had reports of online discard operations being sent from XFS
on write-only workloads. These discards occur as a result of
eofblocks trims that can occur after a large file copy completes.
These discards are slightly confusing for users who might be paying
close attention to online discards (i.e., vdo) due to performance
sensitivity. They also happen to be spurious because freed post-eof
blocks by definition have not been written to during the current
allocation cycle.
Update xfs_free_eofblocks() to skip discards that are purely
attributed to eofblocks trims. This cuts down the number of spurious
discards that may occur on write-only workloads due to normal
preallocation activity.
Note that discards of post-eof extents can still occur from other
codepaths that do not isolate handling of post-eof blocks from those
within eof. For example, file unlinks and truncates may still cause
discards for any file blocks affected by the operation.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Brian Foster [Wed, 9 May 2018 15:45:04 +0000 (08:45 -0700)]
xfs: add bmapi nodiscard flag
Freed extents are unconditionally discarded when online discard is
enabled. Define XFS_BMAPI_NODISCARD to allow callers to bypass
discards when unnecessary. For example, this will be useful for
eofblocks trimming.
This patch does not change behavior.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Dave Chinner [Wed, 9 May 2018 14:49:37 +0000 (07:49 -0700)]
xfs: get rid of the log item descriptor
It's just a connector between a transaction and a log item. There's
a 1:1 relationship between a log item descriptor and a log item,
and a 1:1 relationship between a log item descriptor and a
transaction. Both relationships are created and terminated at the
same time, so why do we even have the descriptor?
Replace it with a specific list_head in the log item and a new
log item dirtied flag to replace the XFS_LID_DIRTY flag.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[darrick: fix up deferred agfl intent finish_item use of LID_DIRTY]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Dave Chinner [Wed, 9 May 2018 14:49:10 +0000 (07:49 -0700)]
xfs: add some more debug checks to buffer log item reuse
Just to make sure the item isn't associated with another
transaction when we try to reuse it.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Dave Chinner [Wed, 9 May 2018 14:49:10 +0000 (07:49 -0700)]
xfs: fix double ijoin in xfs_reflink_clear_inode_flag()
xfs_reflink_clear_inode_flag double-joins an inode to a transaction,
which is not allowed. Fix that and document that the caller must have
already joined it.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
[darrick: edit out trace for nonexistent ASSERT]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Dave Chinner [Wed, 9 May 2018 14:49:09 +0000 (07:49 -0700)]
xfs: fix double ijoin in xfs_reflink_cancel_cow_range
xfs_reflink_cancel_cow_range joins an inode twice to the same
transaction. This is not allowed, so fix it and document that the
callers of xfs_reflink_cancel_cow_blocks() must have already joined the
inode to the permanent transaction passed in.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
[darrick: edited the commit log to remove trace for nonexistent ASSERT]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Dave Chinner [Wed, 9 May 2018 14:49:09 +0000 (07:49 -0700)]
xfs: fix double ijoin in xfs_inactive_symlink_rmt()
xfs_inactive_symlink_rmt() does something nasty - it joins an inode
into a transaction it is already joined to. This means the inode can
have multiple log item descriptors attached to the transaction for
it. This breaks teh 1:1 mapping that is supposed to exist
between the log item and log item descriptor.
This results in the log item being processed twice during
transaction commit and CIL formatting, and there are lots of other
potential issues tha arise from double processing of log items in
the transaction commit state machine.
In this case, the inode is already held by the rolling transaction
returned from xfs_defer_finish(), so there's no need to join it
again.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Dave Chinner [Wed, 9 May 2018 14:49:09 +0000 (07:49 -0700)]
xfs: don't assert fail with AIL lock held
Been hitting AIL ordering assert failures recently, but been unable
to trace them down because the system immediately hangs up onteh
spinlock that was held when this assert fires:
XFS: Assertion failed: XFS_LSN_CMP(prev_lip->li_lsn, lip->li_lsn) <= 0, file: fs/xfs/xfs_trans_ail.c, line: 52
Move the assertions outside of the spinlock so the corpse can
be dissected. Thanks to Brian Foster for supplying a clean
way of doing this.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Dave Chinner [Wed, 9 May 2018 14:48:52 +0000 (07:48 -0700)]
xfs: adder caller IP to xfs_defer* tracepoints
So it's clear in the trace where they are being called from.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Dave Chinner [Wed, 9 May 2018 14:47:57 +0000 (07:47 -0700)]
xfs: add tracing to high level transaction operations
Because currently we have no idea what the transaction context we
are operating in is, and I need to know that information to track
down bugs in multiple log item joins to transactions.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Dave Chinner [Wed, 9 May 2018 14:47:34 +0000 (07:47 -0700)]
xfs: log item flags are racy
The log item flags contain a field that is protected by the AIL
lock - the XFS_LI_IN_AIL flag. We use non-atomic RMW operations to
set and clear these flags, but most of the updates and checks are
not done with the AIL lock held and so are susceptible to update
races.
Fix this by changing the log item flags to use atomic bitops rather
than be reliant on the AIL lock for update serialisation.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Darrick J. Wong [Fri, 4 May 2018 22:31:21 +0000 (15:31 -0700)]
xfs: add missing rmap error return
xfs_rmap_lookup_le_range can return errors, so we need to check for
them and bail out.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Darrick J. Wong [Fri, 4 May 2018 22:31:21 +0000 (15:31 -0700)]
xfs: bmap debugging should never panic the system
Don't panic() the system if the bmap records are garbage, just call
ASSERT which gives us the same backtrace but enables developers to
control if the system goes down or not. This makes debugging with
generic/388 much easier because it won't reboot the machine midway
through a run just because btree_read_bufl returns EIO when the fs has
already shut down.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Brian Foster [Tue, 8 May 2018 00:38:48 +0000 (17:38 -0700)]
xfs: defer agfl frees from directory op transactions
Directory operations can perform block allocations as entries are
added/removed from directories. Defer AGFL block frees from the
remaining directory operation transactions. This covers the hard
link, remove and rename operations.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Brian Foster [Tue, 8 May 2018 00:38:48 +0000 (17:38 -0700)]
xfs: defer frees from common inode allocation paths
Inode allocation can require block allocation for physical inode
chunk allocation, inode btree record insertion, and/or directory
block allocation for entry insertion. Any of these block allocation
requests can require AGFL fixups prior to the actual allocation.
Update the common file creation transacions to defer AGFL frees from
these contexts to avoid too much log reservation consumption
per-transaction.
Since these transactions are already passed down through the btree
cursors and da_args structure, this simply requires to attach dfops
to the transaction. Note that this covers tr_create, tr_mkdir and
tr_symlink. Other transactions such as tr_create_tmpfile do not
already make use of deferred operations and so are left alone for
the time being.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Brian Foster [Tue, 8 May 2018 00:38:48 +0000 (17:38 -0700)]
xfs: defer agfl frees from inode inactivation
XFS inode chunks are already freed via deferred operations (which
now also defer AGFL block frees), but inode btree blocks are freed
directly in the associated context. This has been known to lead to
log reservation overruns in particular workloads where an inobt
block free may require several AGFL block frees (and thus several
allocation btree modifications) before the inobt block itself is
actually freed.
To avoid this problem, defer the frees of any AGFL blocks before the
inobt block free takes place. This requires passing the dfops from
xfs_inactive_ifree() down through the inobt ->[alloc|free]_block()
callouts, which essentially only requires to attach the dfops to the
transaction since it is already carried all the way through to the
inobt update and allocation.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Brian Foster [Tue, 8 May 2018 00:38:47 +0000 (17:38 -0700)]
xfs: defer agfl block frees from deferred ops processing context
Now that AGFL block frees are deferred when dfops is set in the
transaction, start deferring AGFL block frees from contexts that are
known to push the limits of existing log reservations.
The first such context is deferred operation processing itself. This
primarily targets deferred extent frees (such as file extents and
inode chunks), but in doing so covers all allocation operations that
occur in deferred operation processing context.
Update xfs_defer_finish() to set and reset ->t_agfl_dfops across the
processing sequence. This means that any AGFL block frees due to
allocation events result in the addition of new EFIs to the dfops
rather than being processed immediately. xfs_defer_finish() rolls
the transaction at least once more to process the frees of the AGFL
blocks back to the allocation btrees and returns once the AGFL is
rectified.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Brian Foster [Tue, 8 May 2018 00:38:47 +0000 (17:38 -0700)]
xfs: defer agfl block frees when dfops is available
The AGFL fixup code executes before every block allocation/free and
rectifies the AGFL based on the current, dynamic allocation
requirements of the fs. The AGFL must hold a minimum number of
blocks to satisfy a worst case split of the free space btrees caused
by the impending allocation operation. The AGFL is also updated to
maintain the implicit requirement for a minimum number of free slots
to satisfy a worst case join of the free space btrees.
Since the AGFL caches individual blocks, AGFL reduction typically
involves multiple, single block frees. We've had reports of
transaction overrun problems during certain workloads that boil down
to AGFL reduction freeing multiple blocks and consuming more space
in the log than was reserved for the transaction.
Since the objective of freeing AGFL blocks is to ensure free AGFL
free slots are available for the upcoming allocation, one way to
address this problem is to release surplus blocks from the AGFL
immediately but defer the free of those blocks (similar to how
file-mapped blocks are unmapped from the file in one transaction and
freed via a deferred operation) until the transaction is rolled.
This turns AGFL reduction into an operation with predictable log
reservation consumption.
Add the capability to defer AGFL block frees when a deferred ops
list is available to the AGFL fixup code. Add a dfops pointer to the
transaction to carry dfops through various contexts to the allocator
context. Deferring AGFL frees is conditional behavior based on
whether the transaction pointer is populated. The long term
objective is to reuse the transaction pointer to clean up all
unrelated callchains that pass dfops on the stack along with a
transaction and in doing so, consistently defer AGFL blocks from the
allocator.
A bit of customization is required to handle deferred completion
processing because AGFL blocks are accounted against a per-ag
reservation pool and AGFL blocks are not inserted into the extent
busy list when freed (they are inserted when used and released back
to the AGFL). Reuse the majority of the existing deferred extent
free infrastructure and customize it appropriately to handle AGFL
blocks.
Note that this patch only adds infrastructure. It does not change
behavior because no callers have been updated to pass ->t_agfl_dfops
into the allocation code.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Brian Foster [Tue, 8 May 2018 00:38:46 +0000 (17:38 -0700)]
xfs: create agfl block free helper function
Refactor the AGFL block free code into a new helper such that it can
be invoked from deferred context. No functional changes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Eric Sandeen [Mon, 7 May 2018 16:20:47 +0000 (09:20 -0700)]
xfs: print specific dqblk that failed verifiers
Rather than printing the top of the buffer that held a corrupted dqblk,
restructure things to print out the specific one that failed by pushing
the calls to the verifier_error function down into the verifier which
iterates over the buffer and detects the error.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>