Dave Chinner [Sun, 12 Feb 2023 22:14:56 +0000 (09:14 +1100)]
xfs: return a referenced perag from filestreams allocator
Now that the filestreams AG selection tracks active perags, we need
to return an active perag to the core allocator code. This is
because the file allocation the filestreams code will run are AG
specific allocations and so need to pin the AG until the allocations
complete.
We cannot rely on the filestreams item reference to do this - the
filestreams association can be torn down at any time, hence we
need to have a separate reference for the allocation process to pin
the AG after it has been selected.
This means there is some perag juggling in allocation failure
fallback paths as they will do all AG scans in the case the AG
specific allocation fails. Hence we need to track the perag
reference that the filestream allocator returned to make sure we
don't leak it on repeated allocation failure.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:56 +0000 (09:14 +1100)]
xfs: pass perag to filestreams tracing
Pass perags instead of raw ag numbers, avoiding the need for the
special peek function for the tracing code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:55 +0000 (09:14 +1100)]
xfs: use for_each_perag_wrap in xfs_filestream_pick_ag
xfs_filestream_pick_ag() is now ready to rework to use
for_each_perag_wrap() for iterating the perags during the AG
selection scan.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:55 +0000 (09:14 +1100)]
xfs: track an active perag reference in filestreams
Rather than just track the agno of the reference, track a referenced
perag pointer instead. This will allow active filestreams to prevent
AGs from going away until the filestreams have been torn down.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:55 +0000 (09:14 +1100)]
xfs: factor out MRU hit case in xfs_filestream_select_ag
Because it now stands out like a sore thumb. Factoring out this case
starts the process of simplifying xfs_filestream_select_ag() again.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:55 +0000 (09:14 +1100)]
xfs: remove xfs_filestream_select_ag() longest extent check
Picking a new AG checks the longest free extent in the AG is valid,
so there's no need to repeat the check in
xfs_filestream_select_ag(). Remove it.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:55 +0000 (09:14 +1100)]
xfs: merge new filestream AG selection into xfs_filestream_select_ag()
This is largely a wrapper around xfs_filestream_pick_ag() that
repeats a lot of the lookups that we just merged back into
xfs_filestream_select_ag() from the lookup code. Merge the
xfs_filestream_new_ag() code back into _select_ag() to get rid
of all the unnecessary logic.
Indeed, this makes it obvious that if we have no parent inode,
the filestreams allocator always selects AG 0 regardless of whether
it is fit for purpose or not.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:55 +0000 (09:14 +1100)]
xfs: merge filestream AG lookup into xfs_filestream_select_ag()
The lookup currently either returns the cached filestream AG or it
calls xfs_filestreams_select_lengths() to looks up a new AG. This
has verify the AG that is selected, so we end up doing "select a new
AG loop in a couple of places when only one really is needed. Merge
the initial lookup functionality with the length selection so that
we only need to do a single pick loop on lookup or verification
failure.
This undoes a lot of the factoring that enabled the selection to be
moved over to the filestreams code. It makes
xfs_filestream_select_ag() an awful messier, but it has to be made
worse before it can get better in future patches...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:55 +0000 (09:14 +1100)]
xfs: move xfs_bmap_btalloc_filestreams() to xfs_filestreams.c
xfs_bmap_btalloc_filestreams() calls two filestreams functions to
select the AG to allocate from. Both those functions end up in
the same selection function that iterates all AGs multiple times.
Worst case, xfs_bmap_btalloc_filestreams() can iterate all AGs 4
times just to select the initial AG to allocate in.
Move the AG selection to fs/xfs/xfs_filestreams.c as a single
interface so that the inefficient AG interation is contained
entirely within the filestreams code. This will allow the
implementation to be simplified and made more efficient in future
patches.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:55 +0000 (09:14 +1100)]
xfs: use xfs_bmap_longest_free_extent() in filestreams
The code in xfs_bmap_longest_free_extent() is open coded in
xfs_filestream_pick_ag(). Export xfs_bmap_longest_free_extent and
call it from the filestreams code instead.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:55 +0000 (09:14 +1100)]
xfs: get rid of notinit from xfs_bmap_longest_free_extent
It is only set if reading the AGF gets a EAGAIN error. Just return
the EAGAIN error and handle that error in the callers.
This means we can remove the not_init parameter from
xfs_bmap_select_minlen(), too, because the use of not_init there is
pessimistic. If we can't read the agf, it won't increase blen.
The only time we actually care whether we checked all the AGFs for
contiguous free space is when the best length is less than the
minimum allocation length. If not_init is set, then we ignore blen
and set the minimum alloc length to the absolute minimum, not the
best length we know already is present.
However, if blen is less than the minimum we're going to ignore it
anyway, regardless of whether we scanned all the AGFs or not. Hence
not_init can go away, because we only use if blen is good from
the scanned AGs otherwise we ignore it altogether and use minlen.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:54 +0000 (09:14 +1100)]
xfs: factor out filestreams from xfs_bmap_btalloc_nullfb
There's many if (filestreams) {} else {} branches in this function.
Split it out into a filestreams specific function so that we can
then work directly on cleaning up the filestreams code without
impacting the rest of the allocation algorithms.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:54 +0000 (09:14 +1100)]
xfs: convert trim to use for_each_perag_range
To convert it to using active perag references and hence make it
shrink safe.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:54 +0000 (09:14 +1100)]
xfs: convert xfs_alloc_vextent_iterate_ags() to use perag walker
Now that the AG iteration code in the core allocation code has been
cleaned up, we can easily convert it to use a for_each_perag..()
variant to use active references and skip AGs that it can't get
active references on.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:54 +0000 (09:14 +1100)]
xfs: move the minimum agno checks into xfs_alloc_vextent_check_args
All of the allocation functions now extract the minimum allowed AG
from the transaction and then use it in some way. The allocation
functions that are restricted to a single AG all check if the
AG requested can be allocated from and return an error if so. These
all set args->agno appropriately.
All the allocation functions that iterate AGs use it to calculate
the scan start AG. args->agno is not set until the iterator starts
walking AGs.
Hence we can easily set up a conditional check against the minimum
AG allowed in xfs_alloc_vextent_check_args() based on whether
args->agno contains NULLAGNUMBER or not and move all the repeated
setup code to xfs_alloc_vextent_check_args(), further simplifying
the allocation functions.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:54 +0000 (09:14 +1100)]
xfs: fold xfs_alloc_ag_vextent() into callers
We don't need the multiplexing xfs_alloc_ag_vextent() provided
anymore - we can just call the exact/near/size variants directly.
This allows us to remove args->type completely and stop using
args->fsbno as an input to the allocator algorithms.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:54 +0000 (09:14 +1100)]
xfs: move allocation accounting to xfs_alloc_vextent_set_fsbno()
Move it from xfs_alloc_ag_vextent() so we can get rid of that layer.
Rename xfs_alloc_vextent_set_fsbno() to xfs_alloc_vextent_finish()
to indicate that it's function is finishing off the allocation that
we've run now that it contains much more functionality.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:54 +0000 (09:14 +1100)]
xfs: introduce xfs_alloc_vextent_prepare()
Now that we have wrapper functions for each type of allocation we
can ask for, we can start unravelling xfs_alloc_ag_vextent(). That
is essentially just a prepare stage, the allocation multiplexer
and a post-allocation accounting step is the allocation proceeded.
The current xfs_alloc_vextent*() wrappers all have a prepare stage,
the allocation operation and a post-allocation accounting step.
We can consolidate this by moving the AG alloc prep code into the
wrapper functions, the accounting code in the wrapper accounting
functions, and cut out the multiplexer layer entirely.
This patch consolidates the AG preparation stage.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:54 +0000 (09:14 +1100)]
xfs: introduce xfs_alloc_vextent_exact_bno()
Two of the callers to xfs_alloc_vextent_this_ag() actually want
exact block number allocation, not anywhere-in-ag allocation. Split
this out from _this_ag() as a first class citizen so no external
extent allocation code needs to care about args->type anymore.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:54 +0000 (09:14 +1100)]
xfs: introduce xfs_alloc_vextent_near_bno()
The remaining callers of xfs_alloc_vextent() are all doing NEAR_BNO
allocations. We can replace that function with a new
xfs_alloc_vextent_near_bno() function that does this explicitly.
We also multiplex NEAR_BNO allocations through
xfs_alloc_vextent_this_ag via args->type. Replace all of these with
direct calls to xfs_alloc_vextent_near_bno(), too.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:53 +0000 (09:14 +1100)]
xfs: use xfs_alloc_vextent_start_bno() where appropriate
Change obvious callers of single AG allocation to use
xfs_alloc_vextent_start_bno(). Callers no long need to specify
XFS_ALLOCTYPE_START_BNO, and so the type can be driven inward and
removed.
While doing this, also pass the allocation target fsb as a parameter
rather than encoding it in args->fsbno.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:53 +0000 (09:14 +1100)]
xfs: use xfs_alloc_vextent_first_ag() where appropriate
Change obvious callers of single AG allocation to use
xfs_alloc_vextent_first_ag(). This gets rid of
XFS_ALLOCTYPE_FIRST_AG as the type used within
xfs_alloc_vextent_first_ag() during iteration is _THIS_AG. Hence we
can remove the setting of args->type from all the callers of
_first_ag() and remove the alloctype.
While doing this, pass the allocation target fsb as a parameter
rather than encoding it in args->fsbno. This starts the process
of making args->fsbno an output only variable rather than
input/output.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:53 +0000 (09:14 +1100)]
xfs: factor xfs_bmap_btalloc()
There are several different contexts xfs_bmap_btalloc() handles, and
large chunks of the code execute independent allocation contexts.
Try to untangle this mess a bit.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:53 +0000 (09:14 +1100)]
xfs: use xfs_alloc_vextent_this_ag() where appropriate
Change obvious callers of single AG allocation to use
xfs_alloc_vextent_this_ag(). Drive the per-ag grabbing out to the
callers, too, so that callers with active references don't need
to do new lookups just for an allocation in a context that already
has a perag reference.
The only remaining caller that does single AG allocation through
xfs_alloc_vextent() is xfs_bmap_btalloc() with
XFS_ALLOCTYPE_NEAR_BNO. That is going to need more untangling before
it can be converted cleanly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:53 +0000 (09:14 +1100)]
xfs: combine __xfs_alloc_vextent_this_ag and xfs_alloc_ag_vextent
There's a bit of a recursive conundrum around
xfs_alloc_ag_vextent(). We can't first call xfs_alloc_ag_vextent()
without preparing the AGFL for the allocation, and preparing the
AGFL calls xfs_alloc_ag_vextent() to prepare the AGFL for the
allocation. This "double allocation" requirement is not really clear
from the current xfs_alloc_fix_freelist() calls that are sprinkled
through the allocation code.
It's not helped that xfs_alloc_ag_vextent() can actually allocate
from the AGFL itself, but there's special code to prevent AGFL prep
allocations from allocating from the free list it's trying to prep.
The naming is also not consistent: args->wasfromfl is true when we
allocated _from_ the free list, but the indication that we are
allocating _for_ the free list is via checking that (args->resv ==
XFS_AG_RESV_AGFL).
So, lets make this "allocation required for allocation" situation
clear by moving it all inside xfs_alloc_ag_vextent(). The freelist
allocation is a specific XFS_ALLOCTYPE_THIS_AG allocation, which
translated directly to xfs_alloc_ag_vextent_size() allocation.
This enables us to replace __xfs_alloc_vextent_this_ag() with a call
to xfs_alloc_ag_vextent(), and we drive the freelist fixing further
into the per-ag allocation algorithm.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:53 +0000 (09:14 +1100)]
xfs: factor xfs_alloc_vextent_this_ag() for _iterate_ags()
The core of the per-ag iteration is effectively doing a "this ag"
allocation on one AG at a time. Use the same code to implement the
core "this ag" allocation in both xfs_alloc_vextent_this_ag()
and xfs_alloc_vextent_iterate_ags().
This means we only call xfs_alloc_ag_vextent() from one place so we
can easily collapse the call stack in future patches.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:53 +0000 (09:14 +1100)]
xfs: rework xfs_alloc_vextent()
It's a multiplexing mess that can be greatly simplified, and really
needs to be simplified to allow active per-ag references to
propagate from initial AG selection code the the bmapi code.
This splits the code out into separate a parameter checking
function, an iterator function, and allocation completion functions
and then implements the individual policies using these functions.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:53 +0000 (09:14 +1100)]
xfs: introduce xfs_for_each_perag_wrap()
In several places we iterate every AG from a specific start agno and
wrap back to the first AG when we reach the end of the filesystem to
continue searching. We don't have a primitive for this iteration
yet, so add one for conversion of these algorithms to per-ag based
iteration.
The filestream AG select code is a mess, and this initially makes it
worse. The per-ag selection needs to be driven completely into the
filestream code to clean this up and it will be done in a future
patch that makes the filestream allocator use active per-ag
references correctly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:52 +0000 (09:14 +1100)]
xfs: perags need atomic operational state
We currently don't have any flags or operational state in the
xfs_perag except for the pagf_init and pagi_init flags. And the
agflreset flag. Oh, there's also the pagf_metadata and pagi_inodeok
flags, too.
For controlling per-ag operations, we are going to need some atomic
state flags. Hence add an opstate field similar to what we already
have in the mount and log, and convert all these state flags across
to atomic bit operations.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:52 +0000 (09:14 +1100)]
xfs: convert xfs_ialloc_next_ag() to an atomic
This is currently a spinlock lock protected rotor which can be
implemented with a single atomic operation. Change it to be more
efficient and get rid of the m_agirotor_lock. Noticed while
converting the inode allocation AG selection loop to active perag
references.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:52 +0000 (09:14 +1100)]
xfs: inobt can use perags in many more places than it does
Lots of code in the inobt infrastructure is passed both xfs_mount
and perags. We only need perags for the per-ag inode allocation
code, so reduce the duplication by passing only the perags as the
primary object.
This ends up reducing the code size by a bit:
text data bss dec hex filename
orig 1138878 323979 548 1463405 16546d (TOTALS)
patched 1138709 323979 548 1463236 1653c4 (TOTALS)
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:52 +0000 (09:14 +1100)]
xfs: use active perag references for inode allocation
Convert the inode allocation routines to use active perag references
or references held by callers rather than grab their own. Also drive
the perag further inwards to replace xfs_mounts when doing
operations on a specific AG.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:52 +0000 (09:14 +1100)]
xfs: convert xfs_imap() to take a perag
Callers have referenced perags but they don't pass it into
xfs_imap() so it takes it's own reference. Fix that so we can change
inode allocation over to using active references.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:52 +0000 (09:14 +1100)]
xfs: rework the perag trace points to be perag centric
So that they all output the same information in the traces to make
debugging refcount issues easier.
This means that all the lookup/drop functions no longer need to use
the full memory barrier atomic operations (atomic*_return()) so
will have less overhead when tracing is off. The set/clear tag
tracepoints no longer abuse the reference count to pass the tag -
the tag being cleared is obvious from the _RET_IP_ that is recorded
in the trace point.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Sun, 12 Feb 2023 22:14:42 +0000 (09:14 +1100)]
xfs: active perag reference counting
We need to be able to dynamically remove instantiated AGs from
memory safely, either for shrinking the filesystem or paging AG
state in and out of memory (e.g. supporting millions of AGs). This
means we need to be able to safely exclude operations from accessing
perags while dynamic removal is in progress.
To do this, introduce the concept of active and passive references.
Active references are required for high level operations that make
use of an AG for a given operation (e.g. allocation) and pin the
perag in memory for the duration of the operation that is operating
on the perag (e.g. transaction scope). This means we can fail to get
an active reference to an AG, hence callers of the new active
reference API must be able to handle lookup failure gracefully.
Passive references are used in low level code, where we might need
to access the perag structure for the purposes of completing high
level operations. For example, buffers need to use passive
references because:
- we need to be able to do metadata IO during operations like grow
and shrink transactions where high level active references to the
AG have already been blocked
- buffers need to pin the perag until they are reclaimed from
memory, something that high level code has no direct control over.
- unused cached buffers should not prevent a shrink from being
started.
Hence we have active references that will form exclusion barriers
for operations to be performed on an AG, and passive references that
will prevent reclaim of the perag until all objects with passive
references have been reclaimed themselves.
This patch introduce xfs_perag_grab()/xfs_perag_rele() as the API
for active AG reference functionality. We also need to convert the
for_each_perag*() iterators to use active references, which will
start the process of converting high level code over to using active
references. Conversion of non-iterator based code to active
references will be done in followup patches.
Note that the implementation using reference counting is really just
a development vehicle for the API to ensure we don't have any leaks
in the callers. Once we need to remove perag structures from memory
dyanmically, we will need a much more robust per-ag state transition
mechanism for preventing new references from being taken while we
wait for existing references to drain before removal from memory can
occur....
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Fri, 10 Feb 2023 17:12:06 +0000 (04:12 +1100)]
xfs: don't assert fail on transaction cancel with deferred ops
We can error out of an allocation transaction when updating BMBT
blocks when things go wrong. This can be a btree corruption, and
unexpected ENOSPC, etc. In these cases, we already have deferred ops
queued for the first allocation that has been done, and we just want
to cancel out the transaction and shut down the filesystem on error.
In fact, we do just that for production systems - the assert that we
can't have a transaction with defer ops attached unless we are
already shut down is bogus and gets in the way of debugging
whatever issue is actually causing the transaction to be cancelled.
Remove the assert because it is causing spurious test failures to
hang test machines.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Fri, 10 Feb 2023 17:11:06 +0000 (04:11 +1100)]
xfs: t_firstblock is tracking AGs not blocks
The tp->t_firstblock field is now raelly tracking the highest AG we
have locked, not the block number of the highest allocation we've
made. It's purpose is to prevent AGF locking deadlocks, so rename it
to "highest AG" and simplify the implementation to just track the
agno rather than a fsbno.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Fri, 10 Feb 2023 17:10:06 +0000 (04:10 +1100)]
xfs: drop firstblock constraints from allocation setup
Now that xfs_alloc_vextent() does all the AGF deadlock prevention
filtering for multiple allocations in a single transaction, we no
longer need the allocation setup code to care about what AGs we
might already have locked.
Hence we can remove all the "nullfb" conditional logic in places
like xfs_bmap_btalloc() and instead have them focus simply on
setting up locality constraints. If the allocation fails due to
AGF lock filtering in xfs_alloc_vextent, then we just fall back as
we normally do to more relaxed allocation constraints.
As a result, any allocation that allows AG scanning (i.e. not
confined to a single AG) and does not force a worst case full
filesystem scan will now be able to attempt allocation from AGs
lower than that defined by tp->t_firstblock. This is because
xfs_alloc_vextent() allows try-locking of the AGFs and hence enables
low space algorithms to at least -try- to get space from AGs lower
than the one that we have currently locked and allocated from. This
is a significant improvement in the low space allocation algorithm.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Fri, 10 Feb 2023 17:09:06 +0000 (04:09 +1100)]
xfs: block reservation too large for minleft allocation
When we enter xfs_bmbt_alloc_block() without having first allocated
a data extent (i.e. tp->t_firstblock == NULLFSBLOCK) because we
are doing something like unwritten extent conversion, the transaction
block reservation is used as the minleft value.
This works for operations like unwritten extent conversion, but it
assumes that the block reservation is only for a BMBT split. THis is
not always true, and sometimes results in larger than necessary
minleft values being set. We only actually need enough space for a
btree split, something we already handle correctly in
xfs_bmapi_write() via the xfs_bmapi_minleft() calculation.
We should use xfs_bmapi_minleft() in xfs_bmbt_alloc_block() to
calculate the number of blocks a BMBT split on this inode is going to
require, not use the transaction block reservation that contains the
maximum number of blocks this transaction may consume in it...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Fri, 10 Feb 2023 17:08:06 +0000 (04:08 +1100)]
xfs: prefer free inodes at ENOSPC over chunk allocation
When an XFS filesystem has free inodes in chunks already allocated
on disk, it will still allocate new inode chunks if the target AG
has no free inodes in it. Normally, this is a good idea as it
preserves locality of all the inodes in a given directory.
However, at ENOSPC this can lead to using the last few remaining
free filesystem blocks to allocate a new chunk when there are many,
many free inodes that could be allocated without consuming free
space. This results in speeding up the consumption of the last few
blocks and inode create operations then returning ENOSPC when there
free inodes available because we don't have enough block left in the
filesystem for directory creation reservations to proceed.
Hence when we are near ENOSPC, we should be attempting to preserve
the remaining blocks for directory block allocation rather than
using them for unnecessary inode chunk creation.
This particular behaviour is exposed by xfs/294, when it drives to
ENOSPC on empty file creation whilst there are still thousands of
free inodes available for allocation in other AGs in the filesystem.
Hence, when we are within 1% of ENOSPC, change the inode allocation
behaviour to prefer to use existing free inodes over allocating new
inode chunks, even though it results is poorer locality of the data
set. It is more important for the allocations to be space efficient
near ENOSPC than to have optimal locality for performance, so lets
modify the inode AG selection code to reflect that fact.
This allows generic/294 to not only pass with this allocator rework
patchset, but to increase the number of post-ENOSPC empty inode
allocations to from ~600 to ~9080 before we hit ENOSPC on the
directory create transaction reservation.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Dave Chinner [Fri, 10 Feb 2023 17:07:06 +0000 (04:07 +1100)]
xfs: fix low space alloc deadlock
I've recently encountered an ABBA deadlock with g/476. The upcoming
changes seem to make this much easier to hit, but the underlying
problem is a pre-existing one.
Essentially, if we select an AG for allocation, then lock the AGF
and then fail to allocate for some reason (e.g. minimum length
requirements cannot be satisfied), then we drop out of the
allocation with the AGF still locked.
The caller then modifies the allocation constraints - usually
loosening them up - and tries again. This can result in trying to
access AGFs that are lower than the AGF we already have locked from
the failed attempt. e.g. the failed attempt skipped several AGs
before failing, so we have locks an AG higher than the start AG.
Retrying the allocation from the start AG then causes us to violate
AGF lock ordering and this can lead to deadlocks.
The deadlock exists even if allocation succeeds - we can do a
followup allocations in the same transaction for BMBT blocks that
aren't guaranteed to be in the same AG as the original, and can move
into higher AGs. Hence we really need to move the tp->t_firstblock
tracking down into xfs_alloc_vextent() where it can be set when we
exit with a locked AG.
xfs_alloc_vextent() can also check there if the requested
allocation falls within the allow range of AGs set by
tp->t_firstblock. If we can't allocate within the range set, we have
to fail the allocation. If we are allowed to to non-blocking AGF
locking, we can ignore the AG locking order limitations as we can
use try-locks for the first iteration over requested AG range.
This invalidates a set of post allocation asserts that check that
the allocation is always above tp->t_firstblock if it is set.
Because we can use try-locks to avoid the deadlock in some
circumstances, having a pre-existing locked AGF doesn't always
prevent allocation from lower order AGFs. Hence those ASSERTs need
to be removed.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Fri, 10 Feb 2023 17:06:06 +0000 (09:06 -0800)]
xfs: revert commit
8954c44ff477
The name passed into __xfs_xattr_put_listent is exactly namelen bytes
long and not null-terminated. Passing namelen+1 to the strscpy function
strscpy(offset, (char *)name, namelen + 1);
is therefore wrong. Go back to the old code, which works fine because
strncpy won't find a null in @name and stops after namelen bytes. It
really could be a memcpy call, but it worked for years.
Reported-by: syzbot+898115bc6d7140437215@syzkaller.appspotmail.com
Fixes:
8954c44ff477 ("xfs: use strscpy() to instead of strncpy()")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Thomas Weißschuh [Fri, 10 Feb 2023 02:56:48 +0000 (18:56 -0800)]
xfs: make kobj_type structures constant
Since commit
ee6d3dd4ed48 ("driver core: make kobj_type constant.")
the driver core allows the usage of const struct kobj_type.
Take advantage of this to constify the structure definitions to prevent
modification at runtime.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Donald Douwsma [Wed, 8 Feb 2023 16:59:56 +0000 (08:59 -0800)]
xfs: allow setting full range of panic tags
xfs will not allow combining other panic masks with
XFS_PTAG_VERIFIER_ERROR.
# sysctl fs.xfs.panic_mask=511
sysctl: setting key "fs.xfs.panic_mask": Invalid argument
fs.xfs.panic_mask = 511
Update to the maximum value that can be set to allow the full range of
masks. Do this using a mask of possible values to prevent this happening
again as suggested by Darrick.
Fixes:
d519da41e2b7 ("xfs: Introduce XFS_PTAG_VERIFIER_ERROR panic mask")
Signed-off-by: Donald Douwsma <ddouwsma@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Dave Chinner [Sun, 5 Feb 2023 16:48:24 +0000 (08:48 -0800)]
xfs: don't use BMBT btree split workers for IO completion
When we split a BMBT due to record insertion, we offload it to a
worker thread because we can be deep in the stack when we try to
allocate a new block for the BMBT. Allocation can use several
kilobytes of stack (full memory reclaim, swap and/or IO path can
end up on the stack during allocation) and we can already be several
kilobytes deep in the stack when we need to split the BMBT.
A recent workload demonstrated a deadlock in this BMBT split
offload. It requires several things to happen at once:
1. two inodes need a BMBT split at the same time, one must be
unwritten extent conversion from IO completion, the other must be
from extent allocation.
2. there must be a no available xfs_alloc_wq worker threads
available in the worker pool.
3. There must be sustained severe memory shortages such that new
kworker threads cannot be allocated to the xfs_alloc_wq pool for
both threads that need split work to be run
4. The split work from the unwritten extent conversion must run
first.
5. when the BMBT block allocation runs from the split work, it must
loop over all AGs and not be able to either trylock an AGF
successfully, or each AGF is is able to lock has no space available
for a single block allocation.
6. The BMBT allocation must then attempt to lock the AGF that the
second task queued to the rescuer thread already has locked before
it finds an AGF it can allocate from.
At this point, we have an ABBA deadlock between tasks queued on the
xfs_alloc_wq rescuer thread and a locked AGF. i.e. The queued task
holding the AGF lock can't be run by the rescuer thread until the
task the rescuer thread is runing gets the AGF lock....
This is a highly improbably series of events, but there it is.
There's a couple of ways to fix this, but the easiest way to ensure
that we only punt tasks with a locked AGF that holds enough space
for the BMBT block allocations to the worker thread.
This works for unwritten extent conversion in IO completion (which
doesn't have a locked AGF and space reservations) because we have
tight control over the IO completion stack. It is typically only 6
functions deep when xfs_btree_split() is called because we've
already offloaded the IO completion work to a worker thread and
hence we don't need to worry about stack overruns here.
The other place we can be called for a BMBT split without a
preceeding allocation is __xfs_bunmapi() when punching out the
center of an existing extent. We don't remove extents in the IO
path, so these operations don't tend to be called with a lot of
stack consumed. Hence we don't really need to ship the split off to
a worker thread in these cases, either.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 1 Feb 2023 18:16:04 +0000 (10:16 -0800)]
xfs: fix confusing variable names in xfs_refcount_item.c
Variable names in this code module are inconsistent and confusing.
xfs_phys_extent describe physical mappings, so rename them "pmap".
xfs_refcount_intents describe refcount intents, so rename them "ri".
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 1 Feb 2023 18:16:04 +0000 (10:16 -0800)]
xfs: pass refcount intent directly through the log intent code
Pass the incore refcount intent through the CUI logging code instead of
repeatedly boxing and unboxing parameters.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 1 Feb 2023 18:16:03 +0000 (10:16 -0800)]
xfs: fix confusing variable names in xfs_rmap_item.c
Variable names in this code module are inconsistent and confusing.
xfs_map_extent describe file mappings, so rename them "map".
xfs_rmap_intents describe block mapping intents, so rename them "ri".
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 1 Feb 2023 18:16:03 +0000 (10:16 -0800)]
xfs: pass rmap space mapping directly through the log intent code
Pass the incore rmap space mapping through the RUI logging code instead
of repeatedly boxing and unboxing parameters.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 1 Feb 2023 18:16:02 +0000 (10:16 -0800)]
xfs: fix confusing xfs_extent_item variable names
Change the name of all pointers to xfs_extent_item structures to "xefi"
to make the name consistent and because the current selections ("new"
and "free") mean other things in C.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 1 Feb 2023 18:16:02 +0000 (10:16 -0800)]
xfs: pass xfs_extent_free_item directly through the log intent code
Pass the incore xfs_extent_free_item through the EFI logging code
instead of repeatedly boxing and unboxing parameters.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 1 Feb 2023 18:16:01 +0000 (10:16 -0800)]
xfs: fix confusing variable names in xfs_bmap_item.c
Variable names in this code module are inconsistent and confusing.
xfs_map_extent describe file mappings, so rename them "map".
xfs_bmap_intents describe block mapping intents, so rename them "bi".
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 1 Feb 2023 17:50:53 +0000 (09:50 -0800)]
xfs: pass the xfs_bmbt_irec directly through the log intent code
Instead of repeatedly boxing and unboxing the incore extent mapping
structure as it passes through the BUI code, pass the pointer directly
through.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Xu Panda [Wed, 1 Feb 2023 17:31:34 +0000 (09:31 -0800)]
xfs: use strscpy() to instead of strncpy()
The implementation of strscpy() is more robust and safer.
That's now the recommended way to copy NUL-terminated strings.
Signed-off-by: Xu Panda <xu.panda@zte.com.cn>
Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Linus Torvalds [Sun, 29 Jan 2023 21:59:43 +0000 (13:59 -0800)]
Linux 6.2-rc6
Linus Torvalds [Sun, 29 Jan 2023 19:26:49 +0000 (11:26 -0800)]
Merge tag 'irq_urgent_for_v6.2_rc6' of git://git./linux/kernel/git/tip/tip
Pull irq fix from Borislav Petkov:
- Cleanup the firmware node for the new IRQ MSI domain properly, to
avoid leaking memory
* tag 'irq_urgent_for_v6.2_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/msi: Free the fwnode created by msi_create_device_irq_domain()
Linus Torvalds [Sun, 29 Jan 2023 19:17:34 +0000 (11:17 -0800)]
Merge tag 'x86_urgent_for_v6.2_rc6' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Start checking for -mindirect-branch-cs-prefix clang support too now
that LLVM 16 will support it
- Fix a NULL ptr deref when suspending with Xen PV
- Have a SEV-SNP guest check explicitly for features enabled by the
hypervisor and fail gracefully if some are unsupported by the guest
instead of failing in a non-obvious and hard-to-debug way
- Fix a MSI descriptor leakage under Xen
- Mark Xen's MSI domain as supporting MSI-X
- Prevent legacy PIC interrupts from being resent in software by
marking them level triggered, as they should be, which lead to a NULL
ptr deref
* tag 'x86_urgent_for_v6.2_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/build: Move '-mindirect-branch-cs-prefix' out of GCC-only block
acpi: Fix suspend with Xen PV
x86/sev: Add SEV-SNP guest feature negotiation support
x86/pci/xen: Fixup fallout from the PCI/MSI overhaul
x86/pci/xen: Set MSI_FLAG_PCI_MSIX support in Xen MSI domain
x86/i8259: Mark legacy PIC interrupts with IRQ_LEVEL
Linus Torvalds [Sun, 29 Jan 2023 19:06:47 +0000 (11:06 -0800)]
Merge tag 'input-for-v6.2-rc5' of git://git./linux/kernel/git/dtor/input
Pull input fixes from Dmitry Torokhov:
- touchpads on HP 15-* laptops switched back to PS/2 emulation mode
- a quirk for Clevo PCX0DX/TUXEDO XP1511 to make sure keyboard is
responding after resume
* tag 'input-for-v6.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: i8042 - add Clevo PCX0DX to i8042 quirk table
Revert "Input: synaptics - switch touchpad on HP Laptop 15-da3001TU to RMI mode"
Linus Torvalds [Sun, 29 Jan 2023 18:47:22 +0000 (10:47 -0800)]
Merge tag 'cxl-fixes-for-6.2-rc6' of git://git./linux/kernel/git/cxl/cxl
Pull cxl fixes from Dan Williams:
"A couple of fixes for bugs introduced during the merge window. One is
a regression, the other was a bug in the CXL AER handler:
- Fix a crash regression due to module load order of cxl_pmem.ko
- Fix wrong register offset read in CXL AER handling path"
* tag 'cxl-fixes-for-6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl/pmem: Fix nvdimm unregistration when cxl_pmem driver is absent
cxl: fix cxl_report_and_clear() RAS UE addr mis-assignment
Vlastimil Babka [Fri, 13 Jan 2023 17:33:45 +0000 (18:33 +0100)]
Revert "mm/compaction: fix set skip in fast_find_migrateblock"
This reverts commit
7efc3b7261030da79001c00d92bc3392fd6c664c.
We have got openSUSE reports (Link 1) for 6.1 kernel with khugepaged
stalling CPU for long periods of time. Investigation of tracepoint data
shows that compaction is stuck in repeating fast_find_migrateblock()
based migrate page isolation, and then fails to migrate all isolated
pages.
Commit
7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock")
was suspected as it was merged in 6.1 and in theory can indeed remove a
termination condition for fast_find_migrateblock() under certain
conditions, as it removes a place that always marks a scanned pageblock
from being re-scanned. There are other such places, but those can be
skipped under certain conditions, which seems to match the tracepoint
data.
Testing of revert also appears to have resolved the issue, thus revert
the commit until a more robust solution for the original problem is
developed.
It's also likely this will fix qemu stalls with 6.1 kernel reported in
Link 2, but that is not yet confirmed.
Link: https://bugzilla.suse.com/show_bug.cgi?id=1206848
Link: https://lore.kernel.org/kvm/b8017e09-f336-3035-8344-c549086c2340@kernel.org/
Link: https://lore.kernel.org/lkml/20230125134434.18017-1-mgorman@techsingularity.net/
Fixes:
7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock")
Cc: <stable@vger.kernel.org>
Tested-by: Pedro Falcato <pedro.falcato@gmail.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 28 Jan 2023 19:17:57 +0000 (11:17 -0800)]
Fix up more non-executable files marked executable
Joe found another DT file that shouldn't be executable, and that
frustrated me enough that I went hunting with this script:
git ls-files -s |
grep '^100755' |
cut -f2 |
xargs grep -L '^#!'
and that found another file that shouldn't have been marked executable
either, despite being in the scripts directory.
Maybe these two are the last ones at least for now. But I'm sure we'll
be back in a few years, fixing things up again.
Fixes:
8c6789f4e2d4 ("ASoC: dt-bindings: Add Everest ES8326 audio CODEC")
Fixes:
4d8e5cd233db ("locking/atomics: Fix scripts/atomic/ script permissions")
Reported-by: Joe Perches <joe@perches.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 28 Jan 2023 18:52:51 +0000 (10:52 -0800)]
Merge tag '6.2-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull ksmbd server fixes from Steve French:
"Four smb3 server fixes, all also for stable:
- fix for signing bug
- fix to more strictly check packet length
- add a max connections parm to limit simultaneous connections
- fix error message flood that can occur with newer Samba xattr
format"
* tag '6.2-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: downgrade ndr version error message to debug
ksmbd: limit pdu length size according to connection status
ksmbd: do not sign response to session request for guest login
ksmbd: add max connections parameter
Linus Torvalds [Sat, 28 Jan 2023 01:41:47 +0000 (17:41 -0800)]
Merge tag '6.2-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fix from Steve French:
"Fix for reconnect oops in smbdirect (RDMA), also is marked for stable"
* tag '6.2-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Fix oops due to uncleared server->smbd_conn in reconnect
Linus Torvalds [Sat, 28 Jan 2023 00:16:57 +0000 (16:16 -0800)]
Merge tag 'block-6.2-2023-01-27' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
"Minor tweaks for this release:
- NVMe pull request via Christoph:
- Flush initial scan_work for async probe (Keith Busch)
- Fix passthrough csi check (Keith Busch)
- Fix nvme-fc initialization order (Ross Lagerwall)
- Fix for tearing down non-started device in ublk (Ming)"
* tag 'block-6.2-2023-01-27' of git://git.kernel.dk/linux:
block: ublk: move ublk_chr_class destroying after devices are removed
nvme: fix passthrough csi check
nvme-pci: flush initial scan_work for async probe
nvme-fc: fix initialization order
Linus Torvalds [Sat, 28 Jan 2023 00:15:06 +0000 (16:15 -0800)]
Merge tag 'io_uring-6.2-2023-01-27' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
"Two small fixes for this release:
- Sanitize how async prep is done for drain requests, so we ensure
that it always gets done (Dylan)
- A ring provided buffer recycling fix for multishot receive (me)"
* tag 'io_uring-6.2-2023-01-27' of git://git.kernel.dk/linux:
io_uring: always prep_async for drain requests
io_uring/net: cache provided buffer group value for multishot receives
Linus Torvalds [Sat, 28 Jan 2023 00:09:12 +0000 (16:09 -0800)]
Merge tag 'hardening-v6.2-rc6' of git://git./linux/kernel/git/kees/linux
Pull hardening fixes from Kees Cook:
- Split slow memcpy tests into MEMCPY_SLOW_KUNIT_TEST
- Reorganize gcc-plugin includes for GCC 13
- Silence bcache memcpy run-time false positive warnings
* tag 'hardening-v6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
bcache: Silence memcpy() run-time false positive warnings
gcc-plugins: Reorganize gimple includes for GCC 13
kunit: memcpy: Split slow memcpy tests into MEMCPY_SLOW_KUNIT_TEST
Linus Torvalds [Sat, 28 Jan 2023 00:03:32 +0000 (16:03 -0800)]
Merge tag 'trace-v6.2-rc5' of git://git./linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix filter memory leak by calling ftrace_free_filter()
- Initialize trace_printk() earlier so that ftrace_dump_on_oops shows
data on early crashes.
- Update the outdated instructions in scripts/tracing/ftrace-bisect.sh
- Add lockdep_is_held() to fix lockdep warning
- Add allocation failure check in create_hist_field()
- Don't initialize pointer that gets set right away in enabled_monitors_write()
- Update MAINTAINER entries
- Fix help messages in Kconfigs
- Fix kernel-doc header for update_preds()
* tag 'trace-v6.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
bootconfig: Update MAINTAINERS file to add tree and mailing list
rv: remove redundant initialization of pointer ptr
ftrace: Maintain samples/ftrace
tracing/filter: fix kernel-doc warnings
lib: Kconfig: fix spellos
trace_events_hist: add check for return value of 'create_hist_field'
tracing/osnoise: Use built-in RCU list checking
tracing: Kconfig: Fix spelling/grammar/punctuation
ftrace/scripts: Update the instructions for ftrace-bisect.sh
tracing: Make sure trace_printk() can output as soon as it can be used
ftrace: Export ftrace_free_filter() to modules
Linus Torvalds [Fri, 27 Jan 2023 21:52:38 +0000 (13:52 -0800)]
Merge tag 'i2c-for-6.2-rc6' of git://git./linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"A bunch of driver fixes with a tiny bit of new IDs"
* tag 'i2c-for-6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: rk3x: fix a bunch of kernel-doc warnings
i2c: axxia: use 'struct' for kernel-doc notation
dt-bindings: i2c: renesas,rzv2m: Fix SoC specific string
i2c: mxs: suppress probe-deferral error message
i2c: designware-pci: Add new PCI IDs for AMD NAVI GPU
i2c: designware: Fix unbalanced suspended flag
i2c: designware: use casting of u64 in clock multiplication to avoid overflow
Linus Torvalds [Fri, 27 Jan 2023 21:47:40 +0000 (13:47 -0800)]
Merge tag 'gpio-fixes-for-v6.2-rc6' of git://git./linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- fix the -c option in the gpio-event-mode user-space example program
- fix the irq number translation in gpio-ep93xx and make its irqchip
immutable
- add a missing spin_unlock in error path in gpio-mxc
- fix a suspend breakage on System76 and Lenovo Gen2a introduced in
GPIO ACPI
* tag 'gpio-fixes-for-v6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
tools: gpio: fix -c option of gpio-event-mon
gpio: ep93xx: remove unused variable
gpio: ep93xx: Make irqchip immutable
gpio: ep93xx: Fix port F hwirq numbers in handler
gpio: mxc: Unlock on error path in mxc_flip_edge()
gpiolib-acpi: Don't set GPIOs for wakeup in S3 mode
Linus Torvalds [Fri, 27 Jan 2023 21:43:46 +0000 (13:43 -0800)]
Merge tag 'regulator-fix-v6.2-rc5' of git://git./linux/kernel/git/broonie/regulator
Pull regulator fix from Mark Brown:
"A fix for the DT binding documentation which dropped a property when
being converted to YAML format causing spurious errors validating
device trees for platforms using the device"
* tag 'regulator-fix-v6.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: dt-bindings: samsung,s2mps14: add lost samsung,ext-control-gpios
Linus Torvalds [Fri, 27 Jan 2023 21:39:30 +0000 (13:39 -0800)]
Merge tag 'ovl-fixes-6.2-rc6' of git://git./linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
"Fix two bugs, a recent one introduced in the last cycle, and an older
one from v5.11"
* tag 'ovl-fixes-6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fail on invalid uid/gid mapping at copy up
ovl: fix tmpfile leak
Linus Torvalds [Fri, 27 Jan 2023 21:18:14 +0000 (13:18 -0800)]
Merge tag 'drm-fixes-2023-01-27' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"Fairly small this week as well, i915 has a memory leak fix and some
minor changes, and amdgpu has some MST fixes, and some other minor
ones:
drm:
- DP MST kref fix
- fb_helper: check return value
i915:
- Fix BSC default context for Meteor Lake
- Fix selftest-scheduler's modify_type
- memory leak fix
amdgpu:
- GC11.x fixes
- SMU13.0.0 fix
- Freesync video fix
- DP MST fixes
- build fix"
* tag 'drm-fixes-2023-01-27' of git://anongit.freedesktop.org/drm/drm:
amdgpu: fix build on non-DCN platforms.
drm/amd/display: Fix timing not changning when freesync video is enabled
drm/display/dp_mst: Correct the kref of port.
drm/amdgpu/display/mst: update mst_mgr relevant variable when long HPD
drm/amdgpu/display/mst: limit payload to be updated one by one
drm/amdgpu/display/mst: Fix mst_state->pbn_div and slot count assignments
drm/amdgpu: declare firmware for new MES 11.0.4
drm/amdgpu: enable imu firmware for GC 11.0.4
drm/amd/pm: add missing AllowIHInterrupt message mapping for SMU13.0.0
drm/amdgpu: remove unconditional trap enable on add gfx11 queues
drm/fb-helper: Use a per-driver FB deferred I/O handler
drm/fb-helper: Check fb_deferred_io_init() return value
drm/i915/selftest: fix intel_selftest_modify_policy argument types
drm/i915/mtl: Fix bcs default context
drm/i915: Fix a memory leak with reused mmap_offset
drm/drm_vma_manager: Add drm_vma_node_allow_once()
Linus Torvalds [Fri, 27 Jan 2023 21:11:19 +0000 (13:11 -0800)]
Merge tag 'acpi-6.2-rc6' of git://git./linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"Add ACPI backlight handling quirks for 3 machines (Hans de Goede)"
* tag 'acpi-6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: video: Add backlight=native DMI quirk for Asus U46E
ACPI: video: Add backlight=native DMI quirk for HP EliteBook 8460p
ACPI: video: Add backlight=native DMI quirk for HP Pavilion g6-1d80nr
Linus Torvalds [Fri, 27 Jan 2023 21:01:36 +0000 (13:01 -0800)]
Merge tag 'thermal-6.2-rc6' of git://git./linux/kernel/git/rafael/linux-pm
Pull thermal control fixes from Rafael Wysocki:
"Add locking to the Intel int340x thermal control driver to prevent its
thermal zone callbacks from racing with firmware-induced thermal trip
point updates (Srinivas Pandruvada, Rafael Wysocki)"
* tag 'thermal-6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
thermal: intel: int340x: Add locking to int340x_thermal_get_trip_type()
thermal: intel: int340x: Protect trip temperature from concurrent updates
Linus Torvalds [Fri, 27 Jan 2023 20:56:45 +0000 (12:56 -0800)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull arm64 fix from Will Deacon:
- Fix event counting regression in Arm CMN PMU driver due to broken
optimisation
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
Partially revert "perf/arm-cmn: Optimise DTC counter accesses"
Linus Torvalds [Fri, 27 Jan 2023 20:52:45 +0000 (12:52 -0800)]
Merge tag 'riscv-for-linus-6.2-rc6' of git://git./linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- A few DT bindings fixes to more closely align the ISA string
requirements between the bindings and the ISA manual.
- A handful of build error/warning fixes.
- A fix to move init_cpu_topology() later in the boot flow, so it can
allocate memory.
- The IRC channel is now in the MAINTAINERS file, so it's easier to
find.
* tag 'riscv-for-linus-6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: Move call to init_cpu_topology() to later initialization stage
riscv/kprobe: Fix instruction simulation of JALR
riscv: fix -Wundef warning for CONFIG_RISCV_BOOT_SPINWAIT
MAINTAINERS: add an IRC entry for RISC-V
RISC-V: fix compile error from deduplicated __ALTERNATIVE_CFG_2
dt-bindings: riscv: fix single letter canonical order
dt-bindings: riscv: fix underscore requirement for multi-letter extensions
Linus Torvalds [Fri, 27 Jan 2023 20:49:00 +0000 (12:49 -0800)]
Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm
Pull ARM fixes from Russell King:
- fix nommu assignment build warning
- fix -Wundef preprocessor warning
- reduce __thumb2__ definitions for crypto files that require it
* tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: 9287/1: Reduce __thumb2__ definition to crypto files that require it
ARM: 9284/1: include <asm/pgtable.h> from proc-macros.S to fix -Wundef warnings
ARM: 9280/1: mm: fix warning on phys_addr_t to void pointer assignment
Linus Torvalds [Fri, 27 Jan 2023 20:41:09 +0000 (12:41 -0800)]
Merge tag 'linux-kselftest-fixes-6.2-rc6' of git://git./linux/kernel/git/shuah/linux-kselftest
Pull Kselftest fixes from Shuah Khan:
"A single fix to a amd-pstate test Makefile bug that deletes source
files during make clean run"
* tag 'linux-kselftest-fixes-6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests: amd-pstate: Don't delete source files via Makefile
Miklos Szeredi [Tue, 24 Jan 2023 15:41:18 +0000 (16:41 +0100)]
ovl: fail on invalid uid/gid mapping at copy up
If st_uid/st_gid doesn't have a mapping in the mounter's user_ns, then
copy-up should fail, just like it would fail if the mounter task was doing
the copy using "cp -a".
There's a corner case where the "cp -a" would succeed but copy up fail: if
there's a mapping of the invalid uid/gid (65534 by default) in the user
namespace. This is because stat(2) will return this value if the mapping
doesn't exist in the current user_ns and "cp -a" will in turn be able to
create a file with this uid/gid.
This behavior would be inconsistent with POSIX ACL's, which return -1 for
invalid uid/gid which result in a failed copy.
For consistency and simplicity fail the copy of the st_uid/st_gid are
invalid.
Fixes:
459c7c565ac3 ("ovl: unprivieged mounts")
Cc: <stable@vger.kernel.org> # v5.11
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Seth Forshee <sforshee@kernel.org>
Miklos Szeredi [Tue, 24 Jan 2023 15:41:18 +0000 (16:41 +0100)]
ovl: fix tmpfile leak
Missed an error cleanup.
Reported-by: syzbot+fd749a7ea127a84e0ffd@syzkaller.appspotmail.com
Fixes:
2b1a77461f16 ("ovl: use vfs_tmpfile_open() helper")
Cc: <stable@vger.kernel.org> # v6.1
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Dylan Yudaken [Fri, 27 Jan 2023 10:59:11 +0000 (02:59 -0800)]
io_uring: always prep_async for drain requests
Drain requests all go through io_drain_req, which has a quick exit in case
there is nothing pending (ie the drain is not useful). In that case it can
run the issue the request immediately.
However for safety it queues it through task work.
The problem is that in this case the request is run asynchronously, but
the async work has not been prepared through io_req_prep_async.
This has not been a problem up to now, as the task work always would run
before returning to userspace, and so the user would not have a chance to
race with it.
However - with IORING_SETUP_DEFER_TASKRUN - this is no longer the case and
the work might be defered, giving userspace a chance to change data being
referred to in the request.
Instead _always_ prep_async for drain requests, which is simpler anyway
and removes this issue.
Cc: stable@vger.kernel.org
Fixes:
c0e0d6ba25f1 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20230127105911.2420061-1-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ivo Borisov Shopov [Thu, 26 Jan 2023 13:10:33 +0000 (15:10 +0200)]
tools: gpio: fix -c option of gpio-event-mon
Following line should listen for a rising edge and exit after the first
one since '-c 1' is provided.
# gpio-event-mon -n gpiochip1 -o 0 -r -c 1
It works with kernel 4.19 but it doesn't work with 5.10. In 5.10 the
above command doesn't exit after the first rising edge it keep listening
for an event forever. The '-c 1' is not taken into an account.
The problem is in commit
62757c32d5db ("tools: gpio: add multi-line
monitoring to gpio-event-mon").
Before this commit the iterator 'i' in monitor_device() is used for
counting of the events (loops). In the case of the above command (-c 1)
we should start from 0 and increment 'i' only ones and hit the 'break'
statement and exit the process. But after the above commit counting
doesn't start from 0, it start from 1 when we listen on one line.
It is because 'i' is used from one more purpose, counting of lines
(num_lines) and it isn't restore to 0 after following code
for (i = 0; i < num_lines; i++)
gpiotools_set_bit(&values.mask, i);
Restore the initial value of the iterator to 0 in order to allow counting
of loops to work for any cases.
Fixes:
62757c32d5db ("tools: gpio: add multi-line monitoring to gpio-event-mon")
Signed-off-by: Ivo Borisov Shopov <ivoshopov@gmail.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
[Bartosz: tweak the commit message]
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Arnd Bergmann [Fri, 27 Jan 2023 09:35:05 +0000 (10:35 +0100)]
gpio: ep93xx: remove unused variable
This one was left behind by a previous cleanup patch:
drivers/gpio/gpio-ep93xx.c: In function 'ep93xx_gpio_add_bank':
drivers/gpio/gpio-ep93xx.c:366:34: error: unused variable 'ic' [-Werror=unused-variable]
Fixes:
216f37366e86 ("gpio: ep93xx: Make irqchip immutable")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Dave Airlie [Fri, 27 Jan 2023 02:31:02 +0000 (12:31 +1000)]
Merge tag 'drm-misc-fixes-2023-01-26' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes
A fix and a preliminary patch to fix a memory leak in i915, and a use
after free fix for fbdev deferred io
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maxime Ripard <maxime@cerno.tech>
Link: https://patchwork.freedesktop.org/patch/msgid/20230126104018.cbrcjxl5wefdbb2f@houat
Dave Airlie [Fri, 27 Jan 2023 02:15:13 +0000 (12:15 +1000)]
amdgpu: fix build on non-DCN platforms.
This fixes the build here locally on my 32-bit arm build.
Signed-off-by: Dave Airlie <airlied@redhat.com>
Dave Airlie [Fri, 27 Jan 2023 01:50:08 +0000 (11:50 +1000)]
Merge tag 'amd-drm-fixes-6.2-2023-01-25' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
amd-drm-fixes-6.2-2023-01-25:
amdgpu:
- GC11.x fixes
- SMU13.0.0 fix
- Freesync video fix
- DP MST fixes
drm:
- DP MST kref fix
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230125220153.320248-1-alexander.deucher@amd.com
Dave Airlie [Fri, 27 Jan 2023 01:39:55 +0000 (11:39 +1000)]
Merge tag 'drm-intel-fixes-2023-01-26' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
- Fix BSC default context for Meteor Lake (Lucas)
- Fix selftest-scheduler's modify_type (Andi)
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/Y9LKD2J5bmICTyIP@intel.com
Jens Axboe [Thu, 26 Jan 2023 18:43:33 +0000 (11:43 -0700)]
Merge tag 'nvme-6.2-2023-01-26' of git://git.infradead.org/nvme into block-6.2
Pull NVMe fixes from Christoph:
"nvme fixes for Linux 6.2
- flush initial scan_work for async probe (Keith Busch)
- fix passthrough csi check (Keith Busch)
- fix nvme-fc initialization order (Ross Lagerwall)"
* tag 'nvme-6.2-2023-01-26' of git://git.infradead.org/nvme:
nvme: fix passthrough csi check
nvme-pci: flush initial scan_work for async probe
nvme-fc: fix initialization order
Linus Torvalds [Thu, 26 Jan 2023 18:29:49 +0000 (10:29 -0800)]
Merge tag 'platform-drivers-x86-v6.2-3' of git://git./linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver fixes from Hans de Goede:
- Fix false positive apple_gmux backlight detection on older iGPU only
MacBook models
- Various other small fixes and hardware-id additions
* tag 'platform-drivers-x86-v6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86: thinkpad_acpi: Fix profile modes on Intel platforms
ACPI: video: Fix apple gmux detection
platform/x86: apple-gmux: Add apple_gmux_detect() helper
platform/x86: apple-gmux: Move port defines to apple-gmux.h
platform/x86: hp-wmi: Fix cast to smaller integer type warning
platform/x86/amd: pmc: Add a module parameter to disable workarounds
platform/x86/amd: pmc: Disable IRQ1 wakeup for RN/CZN
platform/x86: asus-wmi: Fix kbd_dock_devid tablet-switch reporting
platform/x86: gigabyte-wmi: add support for B450M DS3H WIFI-CF
platform/x86: hp-wmi: Handle Omen Key event
platform/x86: dell-wmi: Add a keymap for KEY_MUTE in type 0x0010 table
Linus Torvalds [Thu, 26 Jan 2023 18:20:12 +0000 (10:20 -0800)]
Merge tag 'net-6.2-rc6' of git://git./linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter.
Current release - regressions:
- sched: sch_taprio: do not schedule in taprio_reset()
Previous releases - regressions:
- core: fix UaF in netns ops registration error path
- ipv4: prevent potential spectre v1 gadgets
- ipv6: fix reachability confirmation with proxy_ndp
- netfilter: fix for the set rbtree
- eth: fec: use page_pool_put_full_page when freeing rx buffers
- eth: iavf: fix temporary deadlock and failure to set MAC address
Previous releases - always broken:
- netlink: prevent potential spectre v1 gadgets
- netfilter: fixes for SCTP connection tracking
- mctp: struct sock lifetime fixes
- eth: ravb: fix possible hang if RIS2_QFF1 happen
- eth: tg3: resolve deadlock in tg3_reset_task() during EEH
Misc:
- Mat stepped out as MPTCP co-maintainer"
* tag 'net-6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (40 commits)
net: mdio-mux-meson-g12a: force internal PHY off on mux switch
docs: networking: Fix bridge documentation URL
tsnep: Fix TX queue stop/wake for multiple queues
net/tg3: resolve deadlock in tg3_reset_task() during EEH
net: mctp: mark socks as dead on unhash, prevent re-add
net: mctp: hold key reference when looking up a general key
net: mctp: move expiry timer delete to unhash
net: mctp: add an explicit reference from a mctp_sk_key to sock
net: ravb: Fix possible hang if RIS2_QFF1 happen
net: ravb: Fix lack of register setting after system resumed for Gen3
net/x25: Fix to not accept on connected socket
ice: move devlink port creation/deletion
sctp: fail if no bound addresses can be used for a given scope
net/sched: sch_taprio: do not schedule in taprio_reset()
Revert "Merge branch 'ethtool-mac-merge'"
netrom: Fix use-after-free of a listening socket.
netfilter: conntrack: unify established states for SCTP paths
Revert "netfilter: conntrack: add sctp DATA_SENT state"
netfilter: conntrack: fix bug in for_each_sctp_chunk
netfilter: conntrack: fix vtag checks for ABORT/SHUTDOWN_COMPLETE
...
Linus Torvalds [Thu, 26 Jan 2023 18:05:39 +0000 (10:05 -0800)]
treewide: fix up files incorrectly marked executable
I'm not exactly clear on what strange workflow causes people to do it,
but clearly occasionally some files end up being committed as executable
even though they clearly aren't.
This is a reprise of commit
90fda63fa115 ("treewide: fix up files
incorrectly marked executable"), just with a different set of files (but
with the same trivial shell scripting).
So apparently we need to re-do this every five years or so, and Joe
needs to just keep reminding me to do so ;)
Reported-by: Joe Perches <joe@perches.com>
Fixes:
523375c943e5 ("drm/vmwgfx: Port vmwgfx to arm64")
Fixes:
5c439937775d ("ASoC: codecs: add support for ES8326")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ming Lei [Thu, 26 Jan 2023 11:53:46 +0000 (19:53 +0800)]
block: ublk: move ublk_chr_class destroying after devices are removed
The 'ublk_chr_class' is needed when deleting ublk char devices in
ublk_exit(), so move it after devices(idle) are removed.
Fixes the following warning reported by Harris, James R:
[ 859.178950] sysfs group 'power' not found for kobject 'ublkc0'
[ 859.178962] WARNING: CPU: 3 PID: 1109 at fs/sysfs/group.c:278 sysfs_remove_group+0x9c/0xb0
Reported-by: "Harris, James R" <james.r.harris@intel.com>
Fixes:
71f28f3136af ("ublk_drv: add io_uring based userspace block driver")
Link: https://lore.kernel.org/linux-block/Y9JlFmSgDl3+zy3N@T590/T/#t
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Jim Harris <james.r.harris@intel.com>
Link: https://lore.kernel.org/r/20230126115346.263344-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Robin Murphy [Mon, 23 Jan 2023 18:30:38 +0000 (18:30 +0000)]
Partially revert "perf/arm-cmn: Optimise DTC counter accesses"
It turns out the optimisation implemented by commit
4f2c3872dde5 is
totally broken, since all the places that consume hw->dtcs_used for
events other than cycle count are still not expecting it to be sparsely
populated, and fail to read all the relevant DTC counters correctly if
so.
If implemented correctly, the optimisation potentially saves up to 3
register reads per event update, which is reasonably significant for
events targeting a single node, but still not worth a massive amount of
additional code complexity overall. Getting it right within the current
design looks a fair bit more involved than it was ever intended to be,
so let's just make a functional revert which restores the old behaviour
while still backporting easily.
Fixes:
4f2c3872dde5 ("perf/arm-cmn: Optimise DTC counter accesses")
Reported-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/b41bb4ed7283c3d8400ce5cf5e6ec94915e6750f.1674498637.git.robin.murphy@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Jerome Brunet [Tue, 24 Jan 2023 10:11:57 +0000 (11:11 +0100)]
net: mdio-mux-meson-g12a: force internal PHY off on mux switch
Force the internal PHY off then on when switching to the internal path.
This fixes problems where the PHY ID is not properly set.
Fixes:
7090425104db ("net: phy: add amlogic g12a mdio mux support")
Suggested-by: Qi Duan <qi.duan@amlogic.com>
Co-developed-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
Link: https://lore.kernel.org/r/20230124101157.232234-1-jbrunet@baylibre.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ivan Vecera [Tue, 24 Jan 2023 14:51:26 +0000 (15:51 +0100)]
docs: networking: Fix bridge documentation URL
Current documentation URL [1] is no longer valid.
[1] https://www.linuxfoundation.org/collaborate/workgroups/networking/bridge
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Link: https://lore.kernel.org/r/20230124145127.189221-1-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gerhard Engleder [Tue, 24 Jan 2023 19:14:40 +0000 (20:14 +0100)]
tsnep: Fix TX queue stop/wake for multiple queues
netif_stop_queue() and netif_wake_queue() act on TX queue 0. This is ok
as long as only a single TX queue is supported. But support for multiple
TX queues was introduced with
762031375d5c and I missed to adapt stop
and wake of TX queues.
Use netif_stop_subqueue() and netif_tx_wake_queue() to act on specific
TX queue.
Fixes:
762031375d5c ("tsnep: Support multiple TX/RX queue pairs")
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Link: https://lore.kernel.org/r/20230124191440.56887-1-gerhard@engleder-embedded.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Christensen [Tue, 24 Jan 2023 18:53:39 +0000 (13:53 -0500)]
net/tg3: resolve deadlock in tg3_reset_task() during EEH
During EEH error injection testing, a deadlock was encountered in the tg3
driver when tg3_io_error_detected() was attempting to cancel outstanding
reset tasks:
crash> foreach UN bt
...
PID: 159 TASK:
c0000000067c6000 CPU: 8 COMMAND: "eehd"
...
#5 [
c00000000681f990] __cancel_work_timer at
c00000000019fd18
#6 [
c00000000681fa30] tg3_io_error_detected at
c00800000295f098 [tg3]
#7 [
c00000000681faf0] eeh_report_error at
c00000000004e25c
...
PID: 290 TASK:
c000000036e5f800 CPU: 6 COMMAND: "kworker/6:1"
...
#4 [
c00000003721fbc0] rtnl_lock at
c000000000c940d8
#5 [
c00000003721fbe0] tg3_reset_task at
c008000002969358 [tg3]
#6 [
c00000003721fc60] process_one_work at
c00000000019e5c4
...
PID: 296 TASK:
c000000037a65800 CPU: 21 COMMAND: "kworker/21:1"
...
#4 [
c000000037247bc0] rtnl_lock at
c000000000c940d8
#5 [
c000000037247be0] tg3_reset_task at
c008000002969358 [tg3]
#6 [
c000000037247c60] process_one_work at
c00000000019e5c4
...
PID: 655 TASK:
c000000036f49000 CPU: 16 COMMAND: "kworker/16:2"
...:1
#4 [
c0000000373ebbc0] rtnl_lock at
c000000000c940d8
#5 [
c0000000373ebbe0] tg3_reset_task at
c008000002969358 [tg3]
#6 [
c0000000373ebc60] process_one_work at
c00000000019e5c4
...
Code inspection shows that both tg3_io_error_detected() and
tg3_reset_task() attempt to acquire the RTNL lock at the beginning of
their code blocks. If tg3_reset_task() should happen to execute between
the times when tg3_io_error_deteced() acquires the RTNL lock and
tg3_reset_task_cancel() is called, a deadlock will occur.
Moving tg3_reset_task_cancel() call earlier within the code block, prior
to acquiring RTNL, prevents this from happening, but also exposes another
deadlock issue where tg3_reset_task() may execute AFTER
tg3_io_error_detected() has executed:
crash> foreach UN bt
PID: 159 TASK:
c0000000067d2000 CPU: 9 COMMAND: "eehd"
...
#4 [
c000000006867a60] rtnl_lock at
c000000000c940d8
#5 [
c000000006867a80] tg3_io_slot_reset at
c0080000026c2ea8 [tg3]
#6 [
c000000006867b00] eeh_report_reset at
c00000000004de88
...
PID: 363 TASK:
c000000037564000 CPU: 6 COMMAND: "kworker/6:1"
...
#3 [
c000000036c1bb70] msleep at
c000000000259e6c
#4 [
c000000036c1bba0] napi_disable at
c000000000c6b848
#5 [
c000000036c1bbe0] tg3_reset_task at
c0080000026d942c [tg3]
#6 [
c000000036c1bc60] process_one_work at
c00000000019e5c4
...
This issue can be avoided by aborting tg3_reset_task() if EEH error
recovery is already in progress.
Fixes:
db84bf43ef23 ("tg3: tg3_reset_task() needs to use rtnl_lock to synchronize")
Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Link: https://lore.kernel.org/r/20230124185339.225806-1-drc@linux.vnet.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Namjae Jeon [Tue, 24 Jan 2023 15:09:02 +0000 (00:09 +0900)]
ksmbd: downgrade ndr version error message to debug
When user switch samba to ksmbd, The following message flood is coming
when accessing files. Samba seems to changs dos attribute version to v5.
This patch downgrade ndr version error message to debug.
$ dmesg
...
[68971.766914] ksmbd: v5 version is not supported
[68971.779808] ksmbd: v5 version is not supported
[68971.871544] ksmbd: v5 version is not supported
[68971.910135] ksmbd: v5 version is not supported
...
Cc: stable@vger.kernel.org
Fixes:
e2f34481b24d ("cifsd: add server-side procedures for SMB3")
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Namjae Jeon [Tue, 24 Jan 2023 15:13:20 +0000 (00:13 +0900)]
ksmbd: limit pdu length size according to connection status
Stream protocol length will never be larger than 16KB until session setup.
After session setup, the size of requests will not be larger than
16KB + SMB2 MAX WRITE size. This patch limits these invalidly oversized
requests and closes the connection immediately.
Fixes:
0626e6641f6b ("cifsd: add server handler for central processing and tranport layers")
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-18259
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Dan Williams [Sat, 21 Jan 2023 00:26:12 +0000 (16:26 -0800)]
cxl/pmem: Fix nvdimm unregistration when cxl_pmem driver is absent
The cxl_pmem.ko module houses the driver for both cxl_nvdimm_bridge
objects and cxl_nvdimm objects. When the core creates a cxl_nvdimm it
arranges for it to be autoremoved when the bridge goes down. However, if
the bridge never initialized because the cxl_pmem.ko module never
loaded, it sets up a the following crash scenario:
BUG: kernel NULL pointer dereference, address:
0000000000000478
[..]
RIP: 0010:cxl_nvdimm_probe+0x99/0x140 [cxl_pmem]
[..]
Call Trace:
<TASK>
cxl_bus_probe+0x17/0x50 [cxl_core]
really_probe+0xde/0x380
__driver_probe_device+0x78/0x170
driver_probe_device+0x1f/0x90
__driver_attach+0xd2/0x1c0
bus_for_each_dev+0x79/0xc0
bus_add_driver+0x1b1/0x200
driver_register+0x89/0xe0
cxl_pmem_init+0x50/0xff0 [cxl_pmem]
It turns out the recent rework to simplify nvdimm probing obviated the
need to unregister cxl_nvdimm objects at cxl_nvdimm_bridge ->remove()
time. Leave the cxl_nvdimm device registered until the hosting
cxl_memdev departs. The alternative is that the cxl_memdev needs to be
reattached whenever the cxl_nvdimm_bridge attach state cycles, which is
awkward and unnecessary.
The only requirement is to make sure that when the cxl_nvdimm_bridge
goes away any dependent cxl_nvdimm objects are shutdown. Handle that in
unregister_nvdimm_bus().
With these registration entanglements removed there is no longer a need
to pre-load the cxl_pmem module in cxl_acpi.
Fixes:
cb9cfff82f6a ("cxl/acpi: Simplify cxl_nvdimm_bridge probing")
Reported-by: Gregory Price <gregory.price@memverge.com>
Debugged-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Link: https://lore.kernel.org/r/167426077263.3955046.9695309346988027311.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>