platform/kernel/linux-rpi.git
2 years agoxfs: create shadow transaction reservations for computing minimum log size
Darrick J. Wong [Tue, 26 Apr 2022 01:38:13 +0000 (18:38 -0700)]
xfs: create shadow transaction reservations for computing minimum log size

Every time someone changes the transaction reservation sizes, they
introduce potential compatibility problems if the changes affect the
minimum log size that we validate at mount time.  If the minimum log
size gets larger (which should be avoided because doing so presents a
serious risk of log livelock), filesystems created with old mkfs will
not mount on a newer kernel; if the minimum size shrinks, filesystems
created with newer mkfs will not mount on older kernels.

Therefore, enable the creation of a shadow log reservation structure
where we can "undo" the effects of tweaks when computing minimum log
sizes.  These shadow reservations should never be used in practice, but
they insulate us from perturbations in minimum log size.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2 years agoxfs: remove a __xfs_bunmapi call from reflink
Darrick J. Wong [Tue, 26 Apr 2022 01:38:12 +0000 (18:38 -0700)]
xfs: remove a __xfs_bunmapi call from reflink

This raw call isn't necessary since we can always remove a full delalloc
extent.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2 years agoxfs: stop artificially limiting the length of bunmap calls
Darrick J. Wong [Tue, 26 Apr 2022 01:37:15 +0000 (18:37 -0700)]
xfs: stop artificially limiting the length of bunmap calls

In commit e1a4e37cc7b6, we clamped the length of bunmapi calls on the
data forks of shared files to avoid two failure scenarios: one where the
extent being unmapped is so sparsely shared that we exceed the
transaction reservation with the sheer number of refcount btree updates
and EFI intent items; and the other where we attach so many deferred
updates to the transaction that we pin the log tail and later the log
head meets the tail, causing the log to livelock.

We avoid triggering the first problem by tracking the number of ops in
the refcount btree cursor and forcing a requeue of the refcount intent
item any time we think that we might be close to overflowing.  This has
been baked into XFS since before the original e1a4 patch.

A recent patchset fixed the second problem by changing the deferred ops
code to finish all the work items created by each round of trying to
complete a refcount intent item, which eliminates the long chains of
deferred items (27dad); and causing long-running transactions to relog
their intent log items when space in the log gets low (74f4d).

Because this clamp affects /any/ unmapping request regardless of the
sharing factors of the component blocks, it degrades the performance of
all large unmapping requests -- whereas with an unshared file we can
unmap millions of blocks in one go, shared files are limited to
unmapping a few thousand blocks at a time, which causes the upper level
code to spin in a bunmapi loop even if it wasn't needed.

This also eliminates one more place where log recovery behavior can
differ from online behavior, because bunmapi operations no longer need
to requeue.  The fstest generic/447 was created to test the old fix, and
it still passes with this applied.

Partial-revert-of: e1a4e37cc7b6 ("xfs: try to avoid blowing out the transaction reservation when bunmaping a shared extent")
Depends: 27dada070d59 ("xfs: change the order in which child and parent defer ops ar finished")
Depends: 74f4d6a1e065 ("xfs: only relog deferred intent items if free space in the log gets low")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2 years agoxfs: count EFIs when deciding to ask for a continuation of a refcount update
Darrick J. Wong [Tue, 26 Apr 2022 22:29:54 +0000 (15:29 -0700)]
xfs: count EFIs when deciding to ask for a continuation of a refcount update

A long time ago, I added to XFS the ability to use deferred reference
count operations as part of a transaction chain.  This enabled us to
avoid blowing out the transaction reservation when the blocks in a
physical extent all had different reference counts because we could ask
the deferred operation manager for a continuation, which would get us a
clean transaction.

The refcount code asks for a continuation when the number of refcount
record updates reaches the point where we think that the transaction has
logged enough full btree blocks due to refcount (and free space) btree
shape changes and refcount record updates that we're in danger of
overflowing the transaction.

We did not previously count the EFIs logged to the refcount update
transaction because the clamps on the length of a bunmap operation were
sufficient to avoid overflowing the transaction reservation even in the
worst case situation where every other block of the unmapped extent is
shared.

Unfortunately, the restrictions on bunmap length avoid failure in the
worst case by imposing a maximum unmap length of ~3000 blocks, even for
non-pathological cases.  This seriously limits performance when freeing
large extents.

Therefore, track EFIs with the same counter as refcount record updates,
and use that information as input into when we should ask for a
continuation.  This enables the next patch to drop the clumsy bunmap
limitation.

Depends: 27dada070d59 ("xfs: change the order in which child and parent defer ops ar finished")
Depends: 74f4d6a1e065 ("xfs: only relog deferred intent items if free space in the log gets low")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2 years agoxfs: speed up write operations by using non-overlapped lookups when possible
Darrick J. Wong [Tue, 26 Apr 2022 01:37:06 +0000 (18:37 -0700)]
xfs: speed up write operations by using non-overlapped lookups when possible

Reverse mapping on a reflink-capable filesystem has some pretty high
overhead when performing file operations.  This is because the rmap
records for logically and physically adjacent extents might not be
adjacent in the rmap index due to data block sharing.  As a result, we
use expensive overlapped-interval btree search, which walks every record
that overlaps with the supplied key in the hopes of finding the record.

However, profiling data shows that when the index contains a record that
is an exact match for a query key, the non-overlapped btree search
function can find the record much faster than the overlapped version.
Try the non-overlapped lookup first when we're trying to find the left
neighbor rmap record for a given file mapping, which makes unwritten
extent conversion and remap operations run faster if data block sharing
is minimal in this part of the filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2 years agoxfs: speed up rmap lookups by using non-overlapped lookups when possible
Darrick J. Wong [Tue, 26 Apr 2022 01:37:06 +0000 (18:37 -0700)]
xfs: speed up rmap lookups by using non-overlapped lookups when possible

Reverse mapping on a reflink-capable filesystem has some pretty high
overhead when performing file operations.  This is because the rmap
records for logically and physically adjacent extents might not be
adjacent in the rmap index due to data block sharing.  As a result, we
use expensive overlapped-interval btree search, which walks every record
that overlaps with the supplied key in the hopes of finding the record.

However, profiling data shows that when the index contains a record that
is an exact match for a query key, the non-overlapped btree search
function can find the record much faster than the overlapped version.
Try the non-overlapped lookup first, which will make scrub run much
faster.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2 years agoxfs: simplify xfs_rmap_lookup_le call sites
Darrick J. Wong [Tue, 26 Apr 2022 01:37:06 +0000 (18:37 -0700)]
xfs: simplify xfs_rmap_lookup_le call sites

Most callers of xfs_rmap_lookup_le will retrieve the btree record
immediately if the lookup succeeds.  The overlapped version of this
function (xfs_rmap_lookup_le_range) will return the record if the lookup
succeeds, so make the regular version do it too.  Get rid of the useless
len argument, since it's not part of the lookup key.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2 years agoxfs: capture buffer ops in the xfs_buf tracepoints
Darrick J. Wong [Tue, 26 Apr 2022 01:37:05 +0000 (18:37 -0700)]
xfs: capture buffer ops in the xfs_buf tracepoints

Record the buffer ops in the xfs_buf tracepoints so that we can monitor
the alleged type of the buffer.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2 years agoMerge tag 'large-extent-counters-v9' of https://github.com/chandanr/linux into xfs...
Dave Chinner [Thu, 21 Apr 2022 06:46:17 +0000 (16:46 +1000)]
Merge tag 'large-extent-counters-v9' of https://github.com/chandanr/linux into xfs-5.19-for-next

xfs: Large extent counters

The commit xfs: fix inode fork extent count overflow
(3f8a4f1d876d3e3e49e50b0396eaffcc4ba71b08) mentions that 10 billion
data fork extents should be possible to create. However the
corresponding on-disk field has a signed 32-bit type. Hence this
patchset extends the per-inode data fork extent counter to 64 bits
(out of which 48 bits are used to store the extent count).

Also, XFS has an attribute fork extent counter which is 16 bits
wide. A workload that,
1. Creates 1 million 255-byte sized xattrs,
2. Deletes 50% of these xattrs in an alternating manner,
3. Tries to insert 400,000 new 255-byte sized xattrs
   causes the xattr extent counter to overflow.

Dave tells me that there are instances where a single file has more
than 100 million hardlinks. With parent pointers being stored in
xattrs, we will overflow the signed 16-bits wide attribute extent
counter when large number of hardlinks are created. Hence this
patchset extends the on-disk field to 32-bits.

The following changes are made to accomplish this,
1. A 64-bit inode field is carved out of existing di_pad and
   di_flushiter fields to hold the 64-bit data fork extent counter.
2. The existing 32-bit inode data fork extent counter will be used to
   hold the attribute fork extent counter.
3. A new incompat superblock flag to prevent older kernels from mounting
   the filesystem.

Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoMerge branch 'guilt/xlog-write-rework' into xfs-5.19-for-next
Dave Chinner [Thu, 21 Apr 2022 06:45:52 +0000 (16:45 +1000)]
Merge branch 'guilt/xlog-write-rework' into xfs-5.19-for-next

2 years agoMerge branch 'guilt/xfs-unsigned-flags-5.18' into xfs-5.19-for-next
Dave Chinner [Thu, 21 Apr 2022 06:45:03 +0000 (16:45 +1000)]
Merge branch 'guilt/xfs-unsigned-flags-5.18' into xfs-5.19-for-next

2 years agoMerge branch 'guilt/5.19-miscellaneous' into xfs-5.19-for-next
Dave Chinner [Thu, 21 Apr 2022 01:40:17 +0000 (11:40 +1000)]
Merge branch 'guilt/5.19-miscellaneous' into xfs-5.19-for-next

2 years agoxfs: convert log ticket and iclog flags to unsigned.
Dave Chinner [Thu, 21 Apr 2022 00:48:01 +0000 (10:48 +1000)]
xfs: convert log ticket and iclog flags to unsigned.

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: convert shutdown reasons to unsigned.
Dave Chinner [Thu, 21 Apr 2022 00:47:38 +0000 (10:47 +1000)]
xfs: convert shutdown reasons to unsigned.

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: convert quota options flags to unsigned.
Dave Chinner [Thu, 21 Apr 2022 00:47:32 +0000 (10:47 +1000)]
xfs: convert quota options flags to unsigned.

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: convert ptag flags to unsigned.
Dave Chinner [Thu, 21 Apr 2022 00:47:25 +0000 (10:47 +1000)]
xfs: convert ptag flags to unsigned.

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: convert inode lock flags to unsigned.
Dave Chinner [Thu, 21 Apr 2022 00:47:16 +0000 (10:47 +1000)]
xfs: convert inode lock flags to unsigned.

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: convert log item tracepoint flags to unsigned.
Dave Chinner [Thu, 21 Apr 2022 00:47:07 +0000 (10:47 +1000)]
xfs: convert log item tracepoint flags to unsigned.

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: convert dquot flags to unsigned.
Dave Chinner [Thu, 21 Apr 2022 00:46:55 +0000 (10:46 +1000)]
xfs: convert dquot flags to unsigned.

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: convert da btree operations flags to unsigned.
Dave Chinner [Thu, 21 Apr 2022 00:46:47 +0000 (10:46 +1000)]
xfs: convert da btree operations flags to unsigned.

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: convert buffer log item flags to unsigned.
Dave Chinner [Thu, 21 Apr 2022 00:46:40 +0000 (10:46 +1000)]
xfs: convert buffer log item flags to unsigned.

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: convert btree buffer log flags to unsigned.
Dave Chinner [Thu, 21 Apr 2022 00:46:33 +0000 (10:46 +1000)]
xfs: convert btree buffer log flags to unsigned.

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

We also pass the fields to log to xfs_btree_offsets() as a uint32_t
all cases now. I have no idea why we made that parameter a int64_t
in the first place, but while we are fixing this up change it to
a uint32_t field, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: convert AGI log flags to unsigned.
Dave Chinner [Thu, 21 Apr 2022 00:46:24 +0000 (10:46 +1000)]
xfs: convert AGI log flags to unsigned.

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: convert AGF log flags to unsigned.
Dave Chinner [Thu, 21 Apr 2022 00:46:16 +0000 (10:46 +1000)]
xfs: convert AGF log flags to unsigned.

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: convert bmapi flags to unsigned.
Dave Chinner [Thu, 21 Apr 2022 00:46:09 +0000 (10:46 +1000)]
xfs: convert bmapi flags to unsigned.

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: convert bmap extent type flags to unsigned.
Dave Chinner [Thu, 21 Apr 2022 00:46:01 +0000 (10:46 +1000)]
xfs: convert bmap extent type flags to unsigned.

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: convert scrub type flags to unsigned.
Dave Chinner [Thu, 21 Apr 2022 00:45:52 +0000 (10:45 +1000)]
xfs: convert scrub type flags to unsigned.

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

This touches xfs_fs.h so affects the user API, but the user API
fields are also unsigned so the flags should really be unsigned,
too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: convert attr type flags to unsigned.
Dave Chinner [Thu, 21 Apr 2022 00:45:41 +0000 (10:45 +1000)]
xfs: convert attr type flags to unsigned.

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: CIL context doesn't need to count iovecs
Dave Chinner [Thu, 21 Apr 2022 00:36:56 +0000 (10:36 +1000)]
xfs: CIL context doesn't need to count iovecs

Now that we account for log opheaders in the log item formatting
code, we don't actually use the aggregated count of log iovecs in
the CIL for anything. Remove it and the tracking code that
calculates it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: xlog_write() doesn't need optype anymore
Dave Chinner [Thu, 21 Apr 2022 00:36:48 +0000 (10:36 +1000)]
xfs: xlog_write() doesn't need optype anymore

So remove it from the interface and callers.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: xlog_write() no longer needs contwr state
Dave Chinner [Thu, 21 Apr 2022 00:36:37 +0000 (10:36 +1000)]
xfs: xlog_write() no longer needs contwr state

The rework of xlog_write() no longer requires xlog_get_iclog_state()
to tell it about internal iclog space reservation state to direct it
on what to do. Remove this parameter.

$ size fs/xfs/xfs_log.o.*
   text    data     bss     dec     hex filename
  26520     560       8   27088    69d0 fs/xfs/xfs_log.o.orig
  26384     560       8   26952    6948 fs/xfs/xfs_log.o.patched

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: remove xlog_verify_dest_ptr
Christoph Hellwig [Thu, 21 Apr 2022 00:36:27 +0000 (10:36 +1000)]
xfs: remove xlog_verify_dest_ptr

Just check that the offset in xlog_write_vec is smaller than the iclog
size and remove the expensive cycling through all iclogs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: introduce xlog_write_partial()
Dave Chinner [Thu, 21 Apr 2022 00:36:15 +0000 (10:36 +1000)]
xfs: introduce xlog_write_partial()

Re-implement writing of a log vector that does not fit into the
current iclog. The iclog will already be in XLOG_STATE_WANT_SYNC
because xlog_get_iclog_space() will have reserved all the remaining
iclog space for us, hence we can simply iterate over the iovecs in
the log vector getting more iclog space until the entire log vector
is written.

Handling this partial write case separately means we do need to pass
unnecessary state around for the common, fast path case when the log
vector fits entirely within the current iclog. It isolates the
complexity and allows us to modify and improve the partial write
case without impacting the simple fast path.

This change includes several improvements incorporated from patches
written by Christoph Hellwig.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: introduce xlog_write_full()
Dave Chinner [Thu, 21 Apr 2022 00:36:05 +0000 (10:36 +1000)]
xfs: introduce xlog_write_full()

Introduce an optimised version of xlog_write() that is used when the
entire write will fit in a single iclog. This greatly simplifies the
implementation of writing a log vector chain into an iclog, and sets
the ground work for a much more understandable xlog_write()
implementation.

This incorporates some factoring and simplifications proposed by
Christoph Hellwig.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: change the type of ic_datap
Christoph Hellwig [Thu, 21 Apr 2022 00:35:53 +0000 (10:35 +1000)]
xfs: change the type of ic_datap

Turn ic_datap from a char into a void pointer given that it points
to arbitrary data.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[dgc: also remove (char *) cast in xlog_alloc_log()]
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: pass lv chain length into xlog_write()
Dave Chinner [Thu, 21 Apr 2022 00:35:19 +0000 (10:35 +1000)]
xfs: pass lv chain length into xlog_write()

The caller of xlog_write() usually has a close accounting of the
aggregated vector length contained in the log vector chain passed to
xlog_write(). There is no need to iterate the chain to calculate he
length of the data in xlog_write_calculate_len() if the caller is
already iterating that chain to build it.

Passing in the vector length avoids doing an extra chain iteration,
which can be a significant amount of work given that large CIL
commits can have hundreds of thousands of vectors attached to the
chain.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: log ticket region debug is largely useless
Dave Chinner [Thu, 21 Apr 2022 00:35:09 +0000 (10:35 +1000)]
xfs: log ticket region debug is largely useless

xlog_tic_add_region() is used to trace the regions being added to a
log ticket to provide information in the situation where a ticket
reservation overrun occurs. The information gathered is stored int
the ticket, and dumped if xlog_print_tic_res() is called.

For a front end struct xfs_trans overrun, the ticket only contains
reservation tracking information - the ticket is never handed to the
log so has no regions attached to it. The overrun debug information in this
case comes from xlog_print_trans(), which walks the items attached
to the transaction and dumps their attached formatted log vectors
directly. It also dumps the ticket state, but that only contains
reservation accounting and nothing else. Hence xlog_print_tic_res()
never dumps region or overrun information from this path.

xlog_tic_add_region() is actually called from xlog_write(), which
means it is being used to track the regions seen in a
CIL checkpoint log vector chain. In looking at CIL behaviour
recently, I've seen 32MB checkpoints regularly exceed 250,000
regions in the LV chain. The log ticket debug code can track *15*
regions. IOWs, if there is a ticket overrun in the CIL code, the
ticket region tracking code is going to be completely useless for
determining what went wrong. The only thing it can tell us is how
much of an overrun occurred, and we really don't need extra debug
information in the log ticket to tell us that.

Indeed, the main place we call xlog_tic_add_region() is also adding
up the number of regions and the space used so that xlog_write()
knows how much will be written to the log. This is exactly the same
information that log ticket is storing once we take away the useless
region tracking array. Hence xlog_tic_add_region() is not useful,
but can be called 250,000 times a CIL push...

Just strip all that debug "information" out of the of the log ticket
and only have it report reservation space information when an
overrun occurs. This also reduces the size of a log ticket down by
about 150 bytes...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: reserve space and initialise xlog_op_header in item formatting
Dave Chinner [Thu, 21 Apr 2022 00:34:59 +0000 (10:34 +1000)]
xfs: reserve space and initialise xlog_op_header in item formatting

Current xlog_write() adds op headers to the log manually for every
log item region that is in the vector passed to it. While
xlog_write() needs to stamp the transaction ID into the ophdr, we
already know it's length, flags, clientid, etc at CIL commit time.

This means the only time that xlog write really needs to format and
reserve space for a new ophdr is when a region is split across two
iclogs. Adding the opheader and accounting for it as part of the
normal formatted item region means we simplify the accounting
of space used by a transaction and we don't have to special case
reserving of space in for the ophdrs in xlog_write(). It also means
we can largely initialise the ophdr in transaction commit instead
of xlog_write, making the xlog_write formatting inner loop much
tighter.

xlog_prepare_iovec() is now too large to stay as an inline function,
so we move it out of line and into xfs_log.c.

Object sizes:
text    data     bss     dec     hex filename
1125934  305951     484 1432369  15db31 fs/xfs/built-in.a.before
1123360  305951     484 1429795  15d123 fs/xfs/built-in.a.after

So the code is a roughly 2.5kB smaller with xlog_prepare_iovec() now
out of line, even though it grew in size itself.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: move log iovec alignment to preparation function
Dave Chinner [Thu, 21 Apr 2022 00:34:49 +0000 (10:34 +1000)]
xfs: move log iovec alignment to preparation function

To include log op headers directly into the log iovec regions that
the ophdrs wrap, we need to move the buffer alignment code from
xlog_finish_iovec() to xlog_prepare_iovec(). This is because the
xlog_op_header is only 12 bytes long, and we need the buffer that
the caller formats their data into to be 8 byte aligned.

Hence once we start prepending the ophdr in xlog_prepare_iovec(), we
are going to need to manage the padding directly to ensure that the
buffer pointer returned is correctly aligned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: log tickets don't need log client id
Dave Chinner [Thu, 21 Apr 2022 00:34:33 +0000 (10:34 +1000)]
xfs: log tickets don't need log client id

We currently set the log ticket client ID when we reserve a
transaction. This client ID is only ever written to the log by
a CIL checkpoint or unmount records, and so anything using a high
level transaction allocated through xfs_trans_alloc() does not need
a log ticket client ID to be set.

For the CIL checkpoint, the client ID written to the journal is
always XFS_TRANSACTION, and for the unmount record it is always
XFS_LOG, and nothing else writes to the log. All of these operations
tell xlog_write() exactly what they need to write to the log (the
optype) and build their own opheaders for start, commit and unmount
records. Hence we no longer need to set the client id in either the
log ticket or the xfs_trans.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: embed the xlog_op_header in the commit record
Dave Chinner [Thu, 21 Apr 2022 00:34:15 +0000 (10:34 +1000)]
xfs: embed the xlog_op_header in the commit record

Remove the final case where xlog_write() has to prepend an opheader
to a log transaction. Similar to the start record, the commit record
is just an empty opheader with a XLOG_COMMIT_TRANS type, so we can
just make this the payload for the region being passed to
xlog_write() and remove the special handling in xlog_write() for
the commit record.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: embed the xlog_op_header in the unmount record
Dave Chinner [Thu, 21 Apr 2022 00:34:04 +0000 (10:34 +1000)]
xfs: embed the xlog_op_header in the unmount record

Remove another case where xlog_write() has to prepend an opheader to
a log transaction. The unmount record + ophdr is smaller than the
minimum amount of space guaranteed to be free in an iclog (2 *
sizeof(ophdr)) and so we don't have to care about an unmount record
being split across 2 iclogs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: only CIL pushes require a start record
Dave Chinner [Thu, 21 Apr 2022 00:33:48 +0000 (10:33 +1000)]
xfs: only CIL pushes require a start record

So move the one-off start record writing in xlog_write() out into
the static header that the CIL push builds to write into the log
initially. This simplifes the xlog_write() logic a lot.

pahole on x86-64 confirms that the xlog_cil_trans_hdr is correctly
32 bit aligned and packed for copying the log op and transaction
headers directly into the log as a single log region copy.

struct xlog_cil_trans_hdr {
        struct xlog_op_header      oph[2];               /*     0    24 */
        struct xfs_trans_header    thdr;                 /*    24    16 */
        struct xfs_log_iovec       lhdr[2];              /*    40    32 */

        /* size: 72, cachelines: 2, members: 3 */
        /* last cacheline: 8 bytes */
};

A wart is needed to handle the fact that length of the region the
opheader points to doesn't include the opheader length. hence if
we embed the opheader, we have to substract the opheader length from
the length written into the opheader by the generic copying code.
This will eventually go away when everything is converted to
embedded opheaders.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: factor out the CIL transaction header building
Dave Chinner [Thu, 21 Apr 2022 00:33:23 +0000 (10:33 +1000)]
xfs: factor out the CIL transaction header building

It is static code deep in the middle of the CIL push logic. Factor
it out into a helper so that it is clear and easy to modify
separately.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: simplify local variable assignment in file write code
Kaixu Xia [Wed, 20 Apr 2022 22:47:54 +0000 (08:47 +1000)]
xfs: simplify local variable assignment in file write code

Get the struct inode pointer from iocb->ki_filp->f_mapping->host
directly and the other variables are unnecessary, so simplify the
local variables assignment.

Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: reorder iunlink remove operation in xfs_ifree
Dave Chinner [Wed, 20 Apr 2022 22:45:16 +0000 (08:45 +1000)]
xfs: reorder iunlink remove operation in xfs_ifree

The O_TMPFILE creation implementation creates a specific order of
operations for inode allocation/freeing and unlinked list
modification. Currently both are serialised by the AGI, so the order
doesn't strictly matter as long as the are both in the same
transaction.

However, if we want to move the unlinked list insertions largely out
from under the AGI lock, then we have to be concerned about the
order in which we do unlinked list modification operations.
O_TMPFILE creation tells us this order is inode allocation/free,
then unlinked list modification.

Change xfs_ifree() to use this same ordering on unlinked list
removal. This way we always guarantee that when we enter the
iunlinked list removal code from this path, we already have the AGI
locked and we don't have to worry about lock nesting AGI reads
inside unlink list locks because it's already locked and attached to
the transaction.

We can do this safely as the inode freeing and unlinked list removal
are done in the same transaction and hence are atomic operations
with respect to log recovery.

Reported-by: Frank Hofmann <fhofmann@cloudflare.com>
Fixes: 298f7bec503f ("xfs: pin inode backing buffer to the inode log item")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoMAINTAINERS: update IOMAP FILESYSTEM LIBRARY and XFS FILESYSTEM
Tiezhu Yang [Wed, 20 Apr 2022 22:45:14 +0000 (08:45 +1000)]
MAINTAINERS: update IOMAP FILESYSTEM LIBRARY and XFS FILESYSTEM

In IOMAP FILESYSTEM LIBRARY and XFS FILESYSTEM, the M(ail): entry is
redundant with the L(ist): entry, remove the redundant M(ail): entry.

Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: convert buffer flags to unsigned.
Dave Chinner [Wed, 20 Apr 2022 22:44:59 +0000 (08:44 +1000)]
xfs: convert buffer flags to unsigned.

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned. This manifests as a compiler error such as:

/kisskb/src/fs/xfs/./xfs_trace.h:432:2: note: in expansion of macro 'TP_printk'
  TP_printk("dev %d:%d daddr 0x%llx bbcount 0x%x hold %d pincount %d "
  ^
/kisskb/src/fs/xfs/./xfs_trace.h:440:5: note: in expansion of macro '__print_flags'
     __print_flags(__entry->flags, "|", XFS_BUF_FLAGS),
     ^
/kisskb/src/fs/xfs/xfs_buf.h:67:4: note: in expansion of macro 'XBF_UNMAPPED'
  { XBF_UNMAPPED,  "UNMAPPED" }
    ^
/kisskb/src/fs/xfs/./xfs_trace.h:440:40: note: in expansion of macro 'XFS_BUF_FLAGS'
     __print_flags(__entry->flags, "|", XFS_BUF_FLAGS),
                                        ^
/kisskb/src/fs/xfs/./xfs_trace.h: In function 'trace_raw_output_xfs_buf_flags_class':
/kisskb/src/fs/xfs/xfs_buf.h:46:23: error: initializer element is not constant
 #define XBF_UNMAPPED  (1 << 31)/* do not map the buffer */

as __print_flags assigns XFS_BUF_FLAGS to a structure that uses an
unsigned long for the flag. Since this results in the value of
XBF_UNMAPPED causing a signed integer overflow, the result is
technically undefined behavior, which gcc-5 does not accept as an
integer constant.

This is based on a patch from Arnd Bergman <arnd@arndb.de>.

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: Add XFS_SB_FEAT_INCOMPAT_NREXT64 to the list of supported flags
Chandan Babu R [Wed, 11 Aug 2021 05:03:20 +0000 (10:33 +0530)]
xfs: Add XFS_SB_FEAT_INCOMPAT_NREXT64 to the list of supported flags

This commit enables XFS module to work with fs instances having 64-bit
per-inode extent counters by adding XFS_SB_FEAT_INCOMPAT_NREXT64 flag to the
list of supported incompat feature flags.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2 years agoxfs: Enable bulkstat ioctl to support 64-bit per-inode extent counters
Chandan Babu R [Wed, 9 Mar 2022 12:58:37 +0000 (12:58 +0000)]
xfs: Enable bulkstat ioctl to support 64-bit per-inode extent counters

The following changes are made to enable userspace to obtain 64-bit extent
counters,
1. Carve out a new 64-bit field xfs_bulkstat->bs_extents64 from
   xfs_bulkstat->bs_pad[] to hold 64-bit extent counter.
2. Define the new flag XFS_BULK_IREQ_BULKSTAT for userspace to indicate that
   it is capable of receiving 64-bit extent counters.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2 years agoxfs: Decouple XFS_IBULK flags from XFS_IWALK flags
Chandan Babu R [Wed, 9 Mar 2022 12:34:04 +0000 (12:34 +0000)]
xfs: Decouple XFS_IBULK flags from XFS_IWALK flags

A future commit will add a new XFS_IBULK flag which will not have a
corresponding XFS_IWALK flag. In preparation for the change, this commit
separates XFS_IBULK_* flags from XFS_IWALK_* flags.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2 years agoxfs: Conditionally upgrade existing inodes to use large extent counters
Chandan Babu R [Wed, 9 Mar 2022 07:49:36 +0000 (07:49 +0000)]
xfs: Conditionally upgrade existing inodes to use large extent counters

This commit enables upgrading existing inodes to use large extent counters
provided that underlying filesystem's superblock has large extent counter
feature enabled.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2 years agoxfs: Directory's data fork extent counter can never overflow
Chandan Babu R [Tue, 29 Mar 2022 06:14:00 +0000 (06:14 +0000)]
xfs: Directory's data fork extent counter can never overflow

The maximum file size that can be represented by the data fork extent counter
in the worst case occurs when all extents are 1 block in length and each block
is 1KB in size.

With XFS_MAX_EXTCNT_DATA_FORK_SMALL representing maximum extent count and with
1KB sized blocks, a file can reach upto,
(2^31) * 1KB = 2TB

This is much larger than the theoretical maximum size of a directory
i.e. XFS_DIR2_SPACE_SIZE * 3 = ~96GB.

Since a directory's inode can never overflow its data fork extent counter,
this commit removes all the overflow checks associated with
it. xfs_dinode_verify() now performs a rough check to verify if a diretory's
data fork is larger than 96GB.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2 years agoxfs: use a separate frextents counter for rt extent reservations
Darrick J. Wong [Mon, 11 Apr 2022 20:49:42 +0000 (06:49 +1000)]
xfs: use a separate frextents counter for rt extent reservations

As mentioned in the previous commit, the kernel misuses sb_frextents in
the incore mount to reflect both incore reservations made by running
transactions as well as the actual count of free rt extents on disk.
This results in the superblock being written to the log with an
underestimate of the number of rt extents that are marked free in the
rtbitmap.

Teaching XFS to recompute frextents after log recovery avoids
operational problems in the current mount, but it doesn't solve the
problem of us writing undercounted frextents which are then recovered by
an older kernel that doesn't have that fix.

Create an incore percpu counter to mirror the ondisk frextents.  This
new counter will track transaction reservations and the only time we
will touch the incore super counter (i.e the one that gets logged) is
when those transactions commit updates to the rt bitmap.  This is in
contrast to the lazysbcount counters (e.g. fdblocks), where we know that
log recovery will always fix any incorrect counter that we log.
As a bonus, we only take m_sb_lock at transaction commit time.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: recalculate free rt extents after log recovery
Darrick J. Wong [Mon, 11 Apr 2022 20:49:42 +0000 (06:49 +1000)]
xfs: recalculate free rt extents after log recovery

I've been observing periodic corruption reports from xfs_scrub involving
the free rt extent counter (frextents) while running xfs/141.  That test
uses an error injection knob to induce a torn write to the log, and an
arbitrary number of recovery mounts, frextents will count fewer free rt
extents than can be found the rtbitmap.

The root cause of the problem is a combination of the misuse of
sb_frextents in the incore mount to reflect both incore reservations
made by running transactions as well as the actual count of free rt
extents on disk.  The following sequence can reproduce the undercount:

Thread 1 Thread 2
xfs_trans_alloc(rtextents=3)
xfs_mod_frextents(-3)
<blocks>
xfs_attr_set()
xfs_bmap_attr_addfork()
xfs_add_attr2()
xfs_log_sb()
xfs_sb_to_disk()
xfs_trans_commit()
<log flushed to disk>
<log goes down>

Note that thread 1 subtracts 3 from sb_frextents even though it never
commits to using that space.  Thread 2 writes the undercounted value to
the ondisk superblock and logs it to the xattr transaction, which is
then flushed to disk.  At next mount, log recovery will find the logged
superblock and write that back into the filesystem.  At the end of log
recovery, we reread the superblock and install the recovered
undercounted frextents value into the incore superblock.  From that
point on, we've effectively leaked thread 1's transaction reservation.

The correct fix for this is to separate the incore reservation from the
ondisk usage, but that's a matter for the next patch.  Because the
kernel has been logging superblocks with undercounted frextents for a
very long time and we don't demand that sysadmins run xfs_repair after a
crash, fix the undercount by recomputing frextents after log recovery.

Gating this on log recovery is a reasonable balance (I think) between
correcting the problem and slowing down every mount attempt.  Note that
xfs_repair will fix undercounted frextents.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: pass explicit mount pointer to rtalloc query functions
Darrick J. Wong [Mon, 11 Apr 2022 20:49:41 +0000 (06:49 +1000)]
xfs: pass explicit mount pointer to rtalloc query functions

Pass an explicit xfs_mount pointer to the rtalloc query functions so
that they can support transactionless queries.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: Use generic_file_open()
Matthew Wilcox (Oracle) [Mon, 11 Apr 2022 20:49:40 +0000 (06:49 +1000)]
xfs: Use generic_file_open()

Remove the open-coded check of O_LARGEFILE.  This changes the errno
to be the same as other filesystems; it was changed generically in
2.6.24 but that fix skipped XFS.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: Introduce per-inode 64-bit extent counters
Chandan Babu R [Tue, 8 Mar 2022 09:34:28 +0000 (09:34 +0000)]
xfs: Introduce per-inode 64-bit extent counters

This commit introduces new fields in the on-disk inode format to support
64-bit data fork extent counters and 32-bit attribute fork extent
counters. The new fields will be used only when an inode has
XFS_DIFLAG2_NREXT64 flag set. Otherwise we continue to use the regular 32-bit
data fork extent counters and 16-bit attribute fork extent counters.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Suggested-by: Dave Chinner <dchinner@redhat.com>
2 years agoxfs: Replace numbered inode recovery error messages with descriptive ones
Chandan Babu R [Tue, 8 Mar 2022 09:19:46 +0000 (09:19 +0000)]
xfs: Replace numbered inode recovery error messages with descriptive ones

This commit also prints inode fields with invalid values instead of printing
addresses of inode and buffer instances.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Suggested-by: Dave Chinner <dchinner@redhat.com>
2 years agoxfs: Introduce macros to represent new maximum extent counts for data/attr forks
Chandan Babu R [Tue, 16 Nov 2021 09:54:37 +0000 (09:54 +0000)]
xfs: Introduce macros to represent new maximum extent counts for data/attr forks

This commit defines new macros to represent maximum extent counts allowed by
filesystems which have support for large per-inode extent counters.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2 years agoxfs: Use uint64_t to count maximum blocks that can be used by BMBT
Chandan Babu R [Tue, 16 Nov 2021 09:20:01 +0000 (09:20 +0000)]
xfs: Use uint64_t to count maximum blocks that can be used by BMBT

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2 years agoxfs: Introduce XFS_DIFLAG2_NREXT64 and associated helpers
Chandan Babu R [Tue, 16 Nov 2021 09:04:43 +0000 (09:04 +0000)]
xfs: Introduce XFS_DIFLAG2_NREXT64 and associated helpers

This commit adds the new per-inode flag XFS_DIFLAG2_NREXT64 to indicate that
an inode supports 64-bit extent counters. This flag is also enabled by default
on newly created inodes when the corresponding filesystem has large extent
counter feature bit (i.e. XFS_FEAT_NREXT64) set.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2 years agoxfs: Introduce XFS_FSOP_GEOM_FLAGS_NREXT64
Chandan Babu R [Tue, 16 Nov 2021 07:56:54 +0000 (07:56 +0000)]
xfs: Introduce XFS_FSOP_GEOM_FLAGS_NREXT64

XFS_FSOP_GEOM_FLAGS_NREXT64 indicates that the current filesystem instance
supports 64-bit per-inode extent counters.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2 years agoxfs: Introduce XFS_SB_FEAT_INCOMPAT_NREXT64 and associated per-fs feature bit
Chandan Babu R [Tue, 16 Nov 2021 08:39:32 +0000 (08:39 +0000)]
xfs: Introduce XFS_SB_FEAT_INCOMPAT_NREXT64 and associated per-fs feature bit

XFS_SB_FEAT_INCOMPAT_NREXT64 incompat feature bit will be set on filesystems
which support large per-inode extent counters. This commit defines the new
incompat feature bit and the corresponding per-fs feature bit (along with
inline functions to work on it).

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2 years agoxfs: Promote xfs_extnum_t and xfs_aextnum_t to 64 and 32-bits respectively
Chandan Babu R [Tue, 16 Nov 2021 07:28:40 +0000 (07:28 +0000)]
xfs: Promote xfs_extnum_t and xfs_aextnum_t to 64 and 32-bits respectively

A future commit will introduce a 64-bit on-disk data extent counter and a
32-bit on-disk attr extent counter. This commit promotes xfs_extnum_t and
xfs_aextnum_t to 64 and 32-bits in order to correctly handle in-core versions
of these quantities.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2 years agoxfs: Use basic types to define xfs_log_dinode's di_nextents and di_anextents
Chandan Babu R [Tue, 16 Nov 2021 07:20:39 +0000 (07:20 +0000)]
xfs: Use basic types to define xfs_log_dinode's di_nextents and di_anextents

A future commit will increase the width of xfs_extnum_t in order to facilitate
larger per-inode extent counters. Hence this patch now uses basic types to
define xfs_log_dinode->[di_nextents|dianextents].

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2 years agoxfs: Introduce xfs_dfork_nextents() helper
Chandan Babu R [Thu, 27 Aug 2020 10:04:34 +0000 (15:34 +0530)]
xfs: Introduce xfs_dfork_nextents() helper

This commit replaces the macro XFS_DFORK_NEXTENTS() with the helper function
xfs_dfork_nextents(). As of this commit, xfs_dfork_nextents() returns the same
value as XFS_DFORK_NEXTENTS(). A future commit which extends inode's extent
counter fields will add more logic to this helper.

This commit also replaces direct accesses to xfs_dinode->di_[a]nextents
with calls to xfs_dfork_nextents().

No functional changes have been made.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2 years agoxfs: Use xfs_extnum_t instead of basic data types
Chandan Babu R [Fri, 26 Feb 2021 05:54:31 +0000 (11:24 +0530)]
xfs: Use xfs_extnum_t instead of basic data types

xfs_extnum_t is the type to use to declare variables which have values
obtained from xfs_dinode->di_[a]nextents. This commit replaces basic
types (e.g. uint32_t) with xfs_extnum_t for such variables.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2 years agoxfs: Introduce xfs_iext_max_nextents() helper
Chandan Babu R [Thu, 27 Aug 2020 09:39:10 +0000 (15:09 +0530)]
xfs: Introduce xfs_iext_max_nextents() helper

xfs_iext_max_nextents() returns the maximum number of extents possible for one
of data, cow or attribute fork. This helper will be extended further in a
future commit when maximum extent counts associated with data/attribute forks
are increased.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2 years agoxfs: Define max extent length based on on-disk format definition
Chandan Babu R [Mon, 9 Aug 2021 06:35:22 +0000 (12:05 +0530)]
xfs: Define max extent length based on on-disk format definition

The maximum extent length depends on maximum block count that can be stored in
a BMBT record. Hence this commit defines MAXEXTLEN based on
BMBT_BLOCKCOUNT_BITLEN.

While at it, the commit also renames MAXEXTLEN to XFS_MAX_BMBT_EXTLEN.

Suggested-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2 years agoxfs: Move extent count limits to xfs_format.h
Chandan Babu R [Wed, 7 Oct 2020 09:30:03 +0000 (15:00 +0530)]
xfs: Move extent count limits to xfs_format.h

Maximum values associated with extent counters i.e. Maximum extent length,
Maximum data extents and Maximum xattr extents are dictated by the on-disk
format. Hence move these definitions over to xfs_format.h.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2 years agoxfs: Add XFS messages to printk index
Jonathan Lassoff [Mon, 11 Apr 2022 03:06:39 +0000 (13:06 +1000)]
xfs: Add XFS messages to printk index

In order for end users to quickly react to new issues that come up in
production, it is proving useful to leverage the printk indexing system.

This printk index enables kernel developers to use calls to printk()
with changeable format strings (as they always have; no change of
expectations), while enabling end users to examine format strings to
detect changes.
Since end users are using regular expressions to match messages printed
through printk(), being able to detect changes in chosen format strings
from release to release provides a useful signal to review
printk()-matching regular expressions for any necessary updates.

So that detailed XFS messages are captures by this printk index, this
patch wraps the xfs_<level> and xfs_alert_tag functions.

Signed-off-by: Jonathan Lassoff <jof@thejof.com>
Reviewed-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: Simplify XFS logging methods.
Jonathan Lassoff [Mon, 11 Apr 2022 03:06:28 +0000 (13:06 +1000)]
xfs: Simplify XFS logging methods.

Rather than have a constructor to define many nearly-identical
functions, use preprocessor macros to pass down a kernel logging level
to a common function.

Signed-off-by: Jonathan Lassoff <jof@thejof.com>
Reviewed-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoLinux 5.18-rc2
Linus Torvalds [Mon, 11 Apr 2022 00:21:36 +0000 (14:21 -1000)]
Linux 5.18-rc2

2 years agoMerge tag 'tty-5.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Linus Torvalds [Sun, 10 Apr 2022 20:08:50 +0000 (10:08 -1000)]
Merge tag 'tty-5.18-rc2' of git://git./linux/kernel/git/gregkh/tty

Pull serial driver fix from Greg KH:
 "This is a single serial driver fix for a build issue that showed up
  due to changes that came in through the tty tree in 5.18-rc1 that were
  missed previously. It resolves a build error with the mpc52xx_uart
  driver.

  It has been in linux-next this week with no reported problems"

* tag 'tty-5.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  tty: serial: mpc52xx_uart: make rx/tx hooks return unsigned, part II.

2 years agoMerge tag 'staging-5.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh...
Linus Torvalds [Sun, 10 Apr 2022 20:04:30 +0000 (10:04 -1000)]
Merge tag 'staging-5.18-rc2' of git://git./linux/kernel/git/gregkh/staging

Pull staging driver fix from Greg KH:
 "Here is a single staging driver fix for 5.18-rc2 that resolves an
  endian issue for the r8188eu driver. It has been in linux-next all
  this week with no reported problems"

* tag 'staging-5.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  staging: r8188eu: Fix PPPoE tag insertion on little endian systems

2 years agoMerge tag 'driver-core-5.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 10 Apr 2022 19:55:09 +0000 (09:55 -1000)]
Merge tag 'driver-core-5.18-rc2' of git://git./linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here are two small driver core changes for 5.18-rc2.

  They are the final bits in the removal of the default_attrs field in
  struct kobj_type. I had to wait until after 5.18-rc1 for all of the
  changes to do this came in through different development trees, and
  then one new user snuck in. So this series has two changes:

   - removal of the default_attrs field in the powerpc/pseries/vas code.

     The change has been acked by the PPC maintainers to come through
     this tree

   - removal of default_attrs from struct kobj_type now that all
     in-kernel users are removed.

     This cleans up the kobject code a little bit and removes some
     duplicated functionality that confused people (now there is only
     one way to do default groups)

  Both of these have been in linux-next for all of this week with no
  reported problems"

* tag 'driver-core-5.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  kobject: kobj_type: remove default_attrs
  powerpc/pseries/vas: use default_groups in kobj_type

2 years agoMerge tag 'char-misc-5.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregk...
Linus Torvalds [Sun, 10 Apr 2022 19:52:46 +0000 (09:52 -1000)]
Merge tag 'char-misc-5.18-rc2' of git://git./linux/kernel/git/gregkh/char-misc

Pull char/misc driver fix from Greg KH:
 "A single driver fix. It resolves the build warning issue on 32bit
  systems in the habannalabs driver that came in during the 5.18-rc1
  merge cycle.

  It has been in linux-next for all this week with no reported problems"

* tag 'char-misc-5.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  habanalabs: Fix test build failures

2 years agoMerge tag 'powerpc-5.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc...
Linus Torvalds [Sun, 10 Apr 2022 17:36:18 +0000 (07:36 -1000)]
Merge tag 'powerpc-5.18-2' of git://git./linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

 - Fix KVM "lost kick" race, where an attempt to pull a vcpu out of the
   guest could be lost (or delayed until the next guest exit).

 - Disable SCV (system call vectored) when PR KVM guests could be run.

 - Fix KVM PR guests using SCV, by disallowing AIL != 0 for KVM PR
   guests.

 - Add a new KVM CAP to indicate if AIL == 3 is supported.

 - Fix a regression when hotplugging a CPU to a memoryless/cpuless node.

 - Make virt_addr_valid() stricter for 64-bit Book3E & 32-bit, which
   fixes crashes seen due to hardened usercopy.

 - Revert a change to max_mapnr which broke HIGHMEM.

Thanks to Christophe Leroy, Fabiano Rosas, Kefeng Wang, Nicholas Piggin,
and Srikar Dronamraju.

* tag 'powerpc-5.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  Revert "powerpc: Set max_mapnr correctly"
  powerpc: Fix virt_addr_valid() for 64-bit Book3E & 32-bit
  KVM: PPC: Move kvmhv_on_pseries() into kvm_ppc.h
  powerpc/numa: Handle partially initialized numa nodes
  powerpc/64: Fix build failure with allyesconfig in book3s_64_entry.S
  KVM: PPC: Use KVM_CAP_PPC_AIL_MODE_3
  KVM: PPC: Book3S PR: Disallow AIL != 0
  KVM: PPC: Book3S PR: Disable SCV when AIL could be disabled
  KVM: PPC: Book3S HV P9: Fix "lost kick" race

2 years agoMerge tag 'irq-urgent-2022-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 10 Apr 2022 17:25:49 +0000 (07:25 -1000)]
Merge tag 'irq-urgent-2022-04-10' of git://git./linux/kernel/git/tip/tip

Pull irq fixes from Thomas Gleixner:
 "A set of interrupt chip driver fixes:

   - A fix for a long standing bug in the ARM GICv3 redistributor
     polling which uses the wrong bit number to test.

   - Prevent translation of bogus ACPI table entries which map device
     interrupts into the IPI space on ARM GICs.

   - Don't write into the pending register of ARM GICV4 before the scan
     in hardware has completed.

   - A set of build and correctness fixes for the Qualcomm MPM driver"

* tag 'irq-urgent-2022-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/gic, gic-v3: Prevent GSI to SGI translations
  irqchip/gic-v3: Fix GICR_CTLR.RWP polling
  irqchip/gic-v4: Wait for GICR_VPENDBASER.Dirty to clear before descheduling
  irqchip/irq-qcom-mpm: fix return value check in qcom_mpm_init()
  irq/qcom-mpm: Fix build error without MAILBOX

2 years agoMerge tag 'x86_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 10 Apr 2022 17:12:27 +0000 (07:12 -1000)]
Merge tag 'x86_urgent_for_v5.18_rc2' of git://git./linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Fix the MSI message data struct definition

 - Use local labels in the exception table macros to avoid symbol
   conflicts with clang LTO builds

 - A couple of fixes to objtool checking of the relatively newly added
   SLS and IBT code

 - Rename a local var in the WARN* macro machinery to prevent shadowing

* tag 'x86_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/msi: Fix msi message data shadow struct
  x86/extable: Prefer local labels in .set directives
  x86,bpf: Avoid IBT objtool warning
  objtool: Fix SLS validation for kcov tail-call replacement
  objtool: Fix IBT tail-call detection
  x86/bug: Prevent shadowing in __WARN_FLAGS
  x86/mm/tlb: Revert retpoline avoidance approach

2 years agoMerge tag 'perf_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 10 Apr 2022 17:08:22 +0000 (07:08 -1000)]
Merge tag 'perf_urgent_for_v5.18_rc2' of git://git./linux/kernel/git/tip/tip

Pull perf fixes from Borislav Petkov:

 - A couple of fixes to cgroup-related handling of perf events

 - A couple of fixes to event encoding on Sapphire Rapids

 - Pass event caps of inherited events so that perf doesn't fail wrongly
   at fork()

 - Add support for a new Raptor Lake CPU

* tag 'perf_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Always set cpuctx cgrp when enable cgroup event
  perf/core: Fix perf_cgroup_switch()
  perf/core: Use perf_cgroup_info->active to check if cgroup is active
  perf/core: Don't pass task around when ctx sched in
  perf/x86/intel: Update the FRONTEND MSR mask on Sapphire Rapids
  perf/x86/intel: Don't extend the pseudo-encoding to GP counters
  perf/core: Inherit event_caps
  perf/x86/uncore: Add Raptor Lake uncore support
  perf/x86/msr: Add Raptor Lake CPU support
  perf/x86/cstate: Add Raptor Lake support
  perf/x86: Add Intel Raptor Lake support

2 years agoMerge tag 'locking_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 10 Apr 2022 16:56:46 +0000 (06:56 -1000)]
Merge tag 'locking_urgent_for_v5.18_rc2' of git://git./linux/kernel/git/tip/tip

Pull locking fixes from Borislav Petkov:

 - Allow the compiler to optimize away unused percpu accesses and change
   the local_lock_* macros back to inline functions

 - A couple of fixes to static call insn patching

* tag 'locking_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Revert "mm/page_alloc: mark pagesets as __maybe_unused"
  Revert "locking/local_lock: Make the empty local_lock_*() function a macro."
  x86/percpu: Remove volatile from arch_raw_cpu_ptr().
  static_call: Remove __DEFINE_STATIC_CALL macro
  static_call: Properly initialise DEFINE_STATIC_CALL_RET0()
  static_call: Don't make __static_call_return0 static
  x86,static_call: Fix __static_call_return0 for i386

2 years agoMerge tag 'sched_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 10 Apr 2022 16:47:49 +0000 (06:47 -1000)]
Merge tag 'sched_urgent_for_v5.18_rc2' of git://git./linux/kernel/git/tip/tip

Pull scheduler fixes from Borislav Petkov:

 - Use the correct static key checking primitive on the IRQ exit path

 - Two fixes for the new forceidle balancer

* tag 'sched_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  entry: Fix compile error in dynamic_irqentry_exit_cond_resched()
  sched: Teach the forced-newidle balancer about CPU affinity limitation.
  sched/core: Fix forceidle balancing

2 years agoMerge tag 'perf-tools-fixes-for-v5.18-2022-04-09' of git://git.kernel.org/pub/scm...
Linus Torvalds [Sun, 10 Apr 2022 04:45:10 +0000 (18:45 -1000)]
Merge tag 'perf-tools-fixes-for-v5.18-2022-04-09' of git://git./linux/kernel/git/acme/linux

Pull perf tools fixes from Arnaldo Carvalho de Melo:

 - Fix the clang command line option probing and remove some options to
   filter out, fixing the build with the latest clang versions

 - Fix 'perf bench' futex and epoll benchmarks to deal with machines
   with more than 1K CPUs

 - Fix 'perf test tsc' error message when not supported

 - Remap perf ring buffer if there is no space for event, fixing perf
   usage in 32-bit ChromeOS

 - Drop objdump stderr to avoid getting stuck waiting for stdout output
   in 'perf annotate'

 - Fix up garbled output by now showing unwind error messages when
   augmenting frame in best effort mode

 - Fix perf's libperf_print callback, use the va_args eprintf() variant

 - Sync vhost and arm64 cputype headers with the kernel sources

 - Fix 'perf report --mem-mode' with ARM SPE

 - Add missing external commands ('iiostat', etc) to 'perf --list-cmds'

* tag 'perf-tools-fixes-for-v5.18-2022-04-09' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
  perf annotate: Drop objdump stderr to avoid getting stuck waiting for stdout output
  perf tools: Add external commands to list-cmds
  perf docs: Add perf-iostat link to manpages
  perf session: Remap buf if there is no space for event
  perf bench: Fix epoll bench to correct usage of affinity for machines with #CPUs > 1K
  perf bench: Fix futex bench to correct usage of affinity for machines with #CPUs > 1K
  perf tools: Fix perf's libperf_print callback
  perf: arm-spe: Fix perf report --mem-mode
  perf unwind: Don't show unwind error messages when augmenting frame pointer stack
  tools headers arm64: Sync arm64's cputype.h with the kernel sources
  perf test tsc: Fix error message when not supported
  perf build: Don't use -ffat-lto-objects in the python feature test when building with clang-13
  perf python: Fix probing for some clang command line options
  tools build: Filter out options and warnings not supported by clang
  tools build: Use $(shell ) instead of `` to get embedded libperl's ccopts
  tools include UAPI: Sync linux/vhost.h with the kernel sources

2 years agoMerge tag 'cxl+nvdimm-for-5.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 10 Apr 2022 04:31:59 +0000 (18:31 -1000)]
Merge tag 'cxl+nvdimm-for-5.18-rc2' of git://git./linux/kernel/git/nvdimm/nvdimm

Pull cxl and nvdimm fixes from Dan Williams:

 - Fix a compile error in the nvdimm unit tests

 - Fix a shadowed variable warning in the CXL PCI driver

* tag 'cxl+nvdimm-for-5.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  cxl/pci: Drop shadowed variable
  tools/testing/nvdimm: Fix security_init() symbol collision

2 years agoMerge tag 'gpio-fixes-for-v5.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 10 Apr 2022 04:17:43 +0000 (18:17 -1000)]
Merge tag 'gpio-fixes-for-v5.18-rc2' of git://git./linux/kernel/git/brgl/linux

Pull gpio fix from Bartosz Golaszewski:

 - fix a race condition with consumers accessing the fields of GPIO IRQ
   chips before they're fully initialized

* tag 'gpio-fixes-for-v5.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
  gpio: Restrict usage of GPIO chip irq members before initialization

2 years agoMerge tag 'irqchip-fixes-5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git...
Thomas Gleixner [Sat, 9 Apr 2022 20:21:55 +0000 (22:21 +0200)]
Merge tag 'irqchip-fixes-5.18-1' of git://git./linux/kernel/git/maz/arm-platforms into irq/urgent

Pull irqchip fixes from Marc Zyngier:

 - Fix GICv3 polling for RWP in redistributors

 - Reject ACPI attempts to use SGIs on GIC/GICv3

 - Fix unpredictible behaviour when making a VPE non-resident
   with GICv4

 - A couple of fixes for the newly merged qcom-mpm driver

Link: https://lore.kernel.org/lkml/20220409094229.267649-1-maz@kernel.org
2 years agoperf annotate: Drop objdump stderr to avoid getting stuck waiting for stdout output
Ian Rogers [Thu, 7 Apr 2022 23:04:59 +0000 (16:04 -0700)]
perf annotate: Drop objdump stderr to avoid getting stuck waiting for stdout output

If objdump writes to stderr it can block waiting for it to be read. As
perf doesn't read stderr then progress stops with perf waiting for
stdout output.

Signed-off-by: Ian Rogers <irogers@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexandre Truong <alexandre.truong@arm.com>
Cc: Dave Marchevsky <davemarchevsky@fb.com>
Cc: Denis Nikitin <denik@chromium.org>
Cc: German Gomez <german.gomez@arm.com>
Cc: James Clark <james.clark@arm.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Garry <john.garry@huawei.com>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Lexi Shao <shaolexi@huawei.com>
Cc: Li Huafei <lihuafei1@huawei.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Martin Liška <mliska@suse.cz>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Michael Petlan <mpetlan@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Remi Bernon <rbernon@codeweavers.com>
Cc: Riccardo Mancini <rickyman7@gmail.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Richter <tmricht@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: William Cohen <wcohen@redhat.com>
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lore.kernel.org/lkml/20220407230503.1265036-2-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2 years agoperf tools: Add external commands to list-cmds
Michael Petlan [Mon, 4 Apr 2022 22:15:41 +0000 (00:15 +0200)]
perf tools: Add external commands to list-cmds

The `perf --list-cmds` output prints only internal commands, although
there is no reason for that from users' perspective.

Adding the external commands to commands array with NULL function
pointer allows printing all perf commands while not changing the logic
of command handler selection.

Signed-off-by: Michael Petlan <mpetlan@redhat.com>
Acked-by: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20220404221541.30312-2-mpetlan@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2 years agoperf docs: Add perf-iostat link to manpages
Michael Petlan [Mon, 4 Apr 2022 22:15:40 +0000 (00:15 +0200)]
perf docs: Add perf-iostat link to manpages

Signed-off-by: Michael Petlan <mpetlan@redhat.com>
Acked-by: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20220404221541.30312-1-mpetlan@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2 years agoperf session: Remap buf if there is no space for event
Denis Nikitin [Wed, 30 Mar 2022 03:11:30 +0000 (20:11 -0700)]
perf session: Remap buf if there is no space for event

If a perf event doesn't fit into remaining buffer space return NULL to
remap buf and fetch the event again.

Keep the logic to error out on inadequate input from fuzzing.

This fixes perf failing on ChromeOS (with 32b userspace):

  $ perf report -v -i perf.data
  ...
  prefetch_event: head=0x1fffff8 event->header_size=0x30, mmap_size=0x2000000: fuzzed or compressed perf.data?
  Error:
  failed to process sample

Fixes: 57fc032ad643ffd0 ("perf session: Avoid infinite loop when seeing invalid header.size")
Reviewed-by: James Clark <james.clark@arm.com>
Signed-off-by: Denis Nikitin <denik@chromium.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20220330031130.2152327-1-denik@chromium.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2 years agoMerge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Linus Torvalds [Sat, 9 Apr 2022 16:05:46 +0000 (06:05 -1000)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:

 - add support for new devices (ufs, mvsas)

 - a major set of fixes in lpfc

 - get rid of a driver specific ioctl in pcmraid

 - a major rework of aha152x to get rid of the scsi_pointer.

 - minor fixes and obvious changes including several spelling updates.

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (36 commits)
  scsi: megaraid_sas: Target with invalid LUN ID is deleted during scan
  scsi: ufs: ufshpb: Fix a NULL check on list iterator
  scsi: sd: Clean up gendisk if device_add_disk() failed
  scsi: message: fusion: Remove redundant variable dmp
  scsi: mvsas: Add PCI ID of RocketRaid 2640
  scsi: sd: sd_read_cpr() requires VPD pages
  scsi: mpt3sas: Fail reset operation if config request timed out
  scsi: sym53c500_cs: Stop using struct scsi_pointer
  scsi: ufs: ufs-pci: Add support for Intel MTL
  scsi: mpt3sas: Fix mpt3sas_check_same_4gb_region() kdoc comment
  scsi: scsi_debug: Fix sdebug_blk_mq_poll() in_use_bm bitmap use
  scsi: bnx2i: Fix spelling mistake "mis-match" -> "mismatch"
  scsi: bnx2fc: Fix spelling mistake "mis-match" -> "mismatch"
  scsi: zorro7xx: Fix a resource leak in zorro7xx_remove_one()
  scsi: aic7xxx: Use standard PCI subsystem, subdevice defines
  scsi: ufs: qcom: Drop custom Android boot parameters
  scsi: core: sysfs: Remove comments that conflict with the actual logic
  scsi: hisi_sas: Remove stray fallthrough annotation
  scsi: virtio-scsi: Eliminate anonymous module_init & module_exit
  scsi: isci: Fix spelling mistake "doesnt" -> "doesn't"
  ...

2 years agoperf bench: Fix epoll bench to correct usage of affinity for machines with #CPUs...
Athira Rajeev [Wed, 6 Apr 2022 17:51:11 +0000 (23:21 +0530)]
perf bench: Fix epoll bench to correct usage of affinity for machines with #CPUs > 1K

The 'perf bench epoll' testcase fails on systems with more than 1K CPUs.

Testcase: perf bench epoll all

Result snippet:
<<>>
Run summary [PID 106497]: 1399 threads monitoring on 64 file-descriptors for 8 secs.

perf: pthread_create: No such file or directory
<<>>

In epoll benchmarks (ctl, wait) pthread_create is invoked in do_threads
from respective bench_epoll_*  function. Though the logs shows direct
failure from pthread_create, the actual failure is from
"sched_setaffinity" returning EINVAL (invalid argument).

This happens because the default mask size in glibc is 1024. To overcome
this 1024 CPUs mask size limitation of cpu_set_t, change the mask size
using the CPU_*_S macros.

Patch addresses this by fixing all the epoll benchmarks to use CPU_ALLOC
to allocate cpumask, CPU_ALLOC_SIZE for size, and CPU_SET_S to set the
mask.

Reported-by: Disha Goel <disgoel@linux.vnet.ibm.com>
Signed-off-by: Athira Jajeev <atrajeev@linux.vnet.ibm.com>
Tested-by: Disha Goel <disgoel@linux.vnet.ibm.com>
Acked-by: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nageswara R Sastry <rnsastry@linux.ibm.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: https://lore.kernel.org/r/20220406175113.87881-3-atrajeev@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2 years agoperf bench: Fix futex bench to correct usage of affinity for machines with #CPUs...
Athira Rajeev [Wed, 6 Apr 2022 17:51:10 +0000 (23:21 +0530)]
perf bench: Fix futex bench to correct usage of affinity for machines with #CPUs > 1K

The 'perf bench futex' testcase fails on systems with more than 1K CPUs.

Testcase: perf bench futex all

Failure snippet:
<<>>Running futex/hash benchmark...

perf: pthread_create: No such file or directory
<<>>

All the futex benchmarks (ie hash, lock-api, requeue, wake,
wake-parallel), pthread_create is invoked in respective bench_futex_*
function. Though the logs shows direct failure from pthread_create,
strace logs showed that actual failure is from  "sched_setaffinity"
returning EINVAL (invalid argument).

This happens because the default mask size in glibc is 1024. To overcome
this 1024 CPUs mask size limitation of cpu_set_t, change the mask size
using the CPU_*_S macros.

Patch addresses this by fixing all the futex benchmarks to use CPU_ALLOC
to allocate cpumask, CPU_ALLOC_SIZE for size, and CPU_SET_S to set the
mask.

Reported-by: Disha Goel <disgoel@linux.vnet.ibm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Athira Jajeev <atrajeev@linux.vnet.ibm.com>
Tested-by: Disha Goel <disgoel@linux.vnet.ibm.com>
Acked-by: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nageswara R Sastry <rnsastry@linux.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: https://lore.kernel.org/r/20220406175113.87881-2-atrajeev@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2 years agoperf tools: Fix perf's libperf_print callback
Adrian Hunter [Fri, 8 Apr 2022 13:26:25 +0000 (16:26 +0300)]
perf tools: Fix perf's libperf_print callback

eprintf() does not expect va_list as the type of the 4th parameter.

Use veprintf() because it does.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: 428dab813a56ce94 ("libperf: Merge libperf_set_print() into libperf_init()")
Cc: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20220408132625.2451452-1-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2 years agoperf: arm-spe: Fix perf report --mem-mode
James Clark [Fri, 8 Apr 2022 14:40:56 +0000 (15:40 +0100)]
perf: arm-spe: Fix perf report --mem-mode

Since commit bb30acae4c4dacfa ("perf report: Bail out --mem-mode if mem
info is not available") "perf mem report" and "perf report --mem-mode"
don't allow opening the file unless one of the events has
PERF_SAMPLE_DATA_SRC set.

SPE doesn't have this set even though synthetic memory data is generated
after it is decoded. Fix this issue by setting DATA_SRC on SPE events.
This has no effect on the data collected because the SPE driver doesn't
do anything with that flag and doesn't generate samples.

Fixes: bb30acae4c4dacfa ("perf report: Bail out --mem-mode if mem info is not available")
Signed-off-by: James Clark <james.clark@arm.com>
Tested-by: Leo Yan <leo.yan@linaro.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: German Gomez <german.gomez@arm.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Garry <john.garry@huawei.com>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20220408144056.1955535-1-james.clark@arm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2 years agoperf unwind: Don't show unwind error messages when augmenting frame pointer stack
James Clark [Wed, 6 Apr 2022 14:56:51 +0000 (15:56 +0100)]
perf unwind: Don't show unwind error messages when augmenting frame pointer stack

Commit Fixes: b9f6fbb3b2c29736 ("perf arm64: Inject missing frames when
using 'perf record --call-graph=fp'") intended to add a 'best effort'
DWARF unwind that improved the frame pointer stack in most scenarios.

It's expected that the unwind will fail sometimes, but this shouldn't be
reported as an error. It only works when the return address can be
determined from the contents of the link register alone.

Fix the error shown when the unwinder requires extra registers by adding
a new flag that suppresses error messages. This flag is not set in the
normal --call-graph=dwarf unwind mode so that behavior is not changed.

Fixes: b9f6fbb3b2c29736 ("perf arm64: Inject missing frames when using 'perf record --call-graph=fp'")
Reported-by: John Garry <john.garry@huawei.com>
Signed-off-by: James Clark <james.clark@arm.com>
Tested-by: John Garry <john.garry@huawei.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexandre Truong <alexandre.truong@arm.com>
Cc: German Gomez <german.gomez@arm.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20220406145651.1392529-1-james.clark@arm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2 years agotools headers arm64: Sync arm64's cputype.h with the kernel sources
Arnaldo Carvalho de Melo [Sat, 9 Apr 2022 14:48:15 +0000 (11:48 -0300)]
tools headers arm64: Sync arm64's cputype.h with the kernel sources

To get the changes in:

  83bea32ac7ed37bb ("arm64: Add part number for Arm Cortex-A78AE")

That addresses this perf build warning:

  Warning: Kernel ABI header at 'tools/arch/arm64/include/asm/cputype.h' differs from latest version at 'arch/arm64/include/asm/cputype.h'
  diff -u tools/arch/arm64/include/asm/cputype.h arch/arm64/include/asm/cputype.h

Cc: Ali Saidi <alisaidi@amazon.com>
Cc: Andrew Kilroy <andrew.kilroy@arm.com>
Cc: Chanho Park <chanho61.park@samsung.com>
Cc: German Gomez <german.gomez@arm.com>
Cc: James Clark <james.clark@arm.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Will Deacon <will@kernel.org>
Link: http://lore.kernel.org/lkml/
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2 years agoperf test tsc: Fix error message when not supported
Chengdong Li [Fri, 8 Apr 2022 08:47:48 +0000 (16:47 +0800)]
perf test tsc: Fix error message when not supported

By default `perf test tsc` does not return the error message when the
child process detected kernel does not support it. Instead, the child
process prints an error message to stderr, unfortunately stderr is
redirected to /dev/null when verbose <= 0.

This patch does:

- return TEST_SKIP to the parent process instead of TEST_OK when
  perf_read_tsc_conversion() is not supported.

- Add a new subtest of testing if TSC is supported on current
  architecture by moving exist code to a separate function.
  It avoids two places in test__perf_time_to_tsc() that return
  TEST_SKIP by doing this.

- Extend the test suite definition to contain above two subtests.
  Current test_suite and test_case structs do not support printing skip
  reason when the number of subtest less than 1. To print skip reason, it
  is necessary to extend current test suite definition.

Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Chengdong Li <chengdongli@tencent.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: likexu@tencent.com
Link: https://lore.kernel.org/r/20220408084748.43707-1-chengdongli@tencent.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>