platform/kernel/linux-rpi.git
4 years agoext4: fix partial cluster initialization when splitting extent
Jeffle Xu [Fri, 22 May 2020 04:18:44 +0000 (12:18 +0800)]
ext4: fix partial cluster initialization when splitting extent

Fix the bug when calculating the physical block number of the first
block in the split extent.

This bug will cause xfstests shared/298 failure on ext4 with bigalloc
enabled occasionally. Ext4 error messages indicate that previously freed
blocks are being freed again, and the following fsck will fail due to
the inconsistency of block bitmap and bg descriptor.

The following is an example case:

1. First, Initialize a ext4 filesystem with cluster size '16K', block size
'4K', in which case, one cluster contains four blocks.

2. Create one file (e.g., xxx.img) on this ext4 filesystem. Now the extent
tree of this file is like:

...
36864:[0]4:220160
36868:[0]14332:145408
51200:[0]2:231424
...

3. Then execute PUNCH_HOLE fallocate on this file. The hole range is
like:

..
ext4_ext_remove_space: dev 254,16 ino 12 since 49506 end 49506 depth 1
ext4_ext_remove_space: dev 254,16 ino 12 since 49544 end 49546 depth 1
ext4_ext_remove_space: dev 254,16 ino 12 since 49605 end 49607 depth 1
...

4. Then the extent tree of this file after punching is like

...
49507:[0]37:158047
49547:[0]58:158087
...

5. Detailed procedure of punching hole [49544, 49546]

5.1. The block address space:
```
lblk        ~49505  49506   49507~49543     49544~49546    49547~
  ---------+------+-------------+----------------+--------
    extent | hole |   extent | hole  | extent
  ---------+------+-------------+----------------+--------
pblk       ~158045  158046  158047~158083  158084~158086   158087~
```

5.2. The detailed layout of cluster 39521:
```
cluster 39521
<------------------------------->

hole   extent
<----------------------><--------

lblk      49544   49545   49546   49547
+-------+-------+-------+-------+
| | | | |
+-------+-------+-------+-------+
pblk     158084  1580845  158086  158087
```

5.3. The ftrace output when punching hole [49544, 49546]:
- ext4_ext_remove_space (start 49544, end 49546)
  - ext4_ext_rm_leaf (start 49544, end 49546, last_extent [49507(158047), 40], partial [pclu 39522 lblk 0 state 2])
    - ext4_remove_blocks (extent [49507(158047), 40], from 49544 to 49546, partial [pclu 39522 lblk 0 state 2]
      - ext4_free_blocks: (block 158084 count 4)
        - ext4_mballoc_free (extent 1/6753/1)

5.4. Ext4 error message in dmesg:
EXT4-fs error (device vdb): mb_free_blocks:1457: group 1, block 158084:freeing already freed block (bit 6753); block bitmap corrupt.
EXT4-fs error (device vdb): ext4_mb_generate_buddy:747: group 1, block bitmap and bg descriptor inconsistent: 19550 vs 19551 free clusters

In this case, the whole cluster 39521 is freed mistakenly when freeing
pblock 158084~158086 (i.e., the first three blocks of this cluster),
although pblock 158087 (the last remaining block of this cluster) has
not been freed yet.

The root cause of this isuue is that, the pclu of the partial cluster is
calculated mistakenly in ext4_ext_remove_space(). The correct
partial_cluster.pclu (i.e., the cluster number of the first block in the
next extent, that is, lblock 49597 (pblock 158086)) should be 39521 rather
than 39522.

Fixes: f4226d9ea400 ("ext4: fix partial cluster initialization")
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Eric Whitney <enwlinux@gmail.com>
Cc: stable@kernel.org # v3.19+
Link: https://lore.kernel.org/r/1590121124-37096-1-git-send-email-jefflexu@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: avoid race conditions when remounting with options that change dax
Theodore Ts'o [Wed, 10 Jun 2020 15:16:37 +0000 (11:16 -0400)]
ext4: avoid race conditions when remounting with options that change dax

Trying to change dax mount options when remounting could allow mount
options to be enabled for a small amount of time, and then the mount
option change would be reverted.

In the case of "mount -o remount,dax", this can cause a race where
files would temporarily treated as DAX --- and then not.

Cc: stable@kernel.org
Reported-by: syzbot+bca9799bf129256190da@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoEnable ext4 support for per-file/directory dax operations
Theodore Ts'o [Thu, 11 Jun 2020 14:51:44 +0000 (10:51 -0400)]
Enable ext4 support for per-file/directory dax operations

This adds the same per-file/per-directory DAX support for ext4 as was
done for xfs, now that we finally have consensus over what the
interface should be.

4 years agoext4: avoid unnecessary transaction starts during writeback
Jan Kara [Mon, 25 May 2020 08:12:15 +0000 (10:12 +0200)]
ext4: avoid unnecessary transaction starts during writeback

ext4_writepages() currently works in a loop like:
  start a transaction
  scan inode for pages to write
  map and submit these pages
  stop the transaction

This loop results in starting transaction once more than is needed
because in the last iteration we start a transaction only to scan the
inode and find there are no pages to write. This can be significant
increase in number of transaction starts for single-extent files or
files that have all blocks already mapped. Furthermore we already know
from previous iteration whether there are more pages to write or not. So
propagate the information from mpage_prepare_extent_to_map() and avoid
unnecessary looping in case there are no more pages to write.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200525081215.29451-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: don't block for O_DIRECT if IOCB_NOWAIT is set
Jens Axboe [Sun, 24 May 2020 22:53:16 +0000 (16:53 -0600)]
ext4: don't block for O_DIRECT if IOCB_NOWAIT is set

Running with some debug patches to detect illegal blocking triggered the
extend/unaligned condition in ext4. If ext4 needs to extend the file (and
hence go to buffered IO), or if the app is doing unaligned IO, then ext4
asks the iomap code to wait for IO completion. If the caller asked for
no-wait semantics by setting IOCB_NOWAIT, then ext4 should return -EAGAIN
instead.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/76152096-2bbb-7682-8fce-4cb498bcd909@kernel.dk
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: remove the access_ok() check in ext4_ioctl_get_es_cache
Christoph Hellwig [Sat, 23 May 2020 07:30:16 +0000 (09:30 +0200)]
ext4: remove the access_ok() check in ext4_ioctl_get_es_cache

access_ok just checks we are fed a proper user pointer.  We also do that
in copy_to_user itself, so no need to do this early.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200523073016.2944131-10-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agofs: remove the access_ok() check in ioctl_fiemap
Christoph Hellwig [Sat, 23 May 2020 07:30:15 +0000 (09:30 +0200)]
fs: remove the access_ok() check in ioctl_fiemap

access_ok just checks we are fed a proper user pointer.  We also do that
in copy_to_user itself, so no need to do this early.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-9-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agofs: handle FIEMAP_FLAG_SYNC in fiemap_prep
Christoph Hellwig [Sat, 23 May 2020 07:30:14 +0000 (09:30 +0200)]
fs: handle FIEMAP_FLAG_SYNC in fiemap_prep

By moving FIEMAP_FLAG_SYNC handling to fiemap_prep we ensure it is
handled once instead of duplicated, but can still be done under fs locks,
like xfs/iomap intended with its duplicate handling.  Also make sure the
error value of filemap_write_and_wait is propagated to user space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-8-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agofs: move fiemap range validation into the file systems instances
Christoph Hellwig [Sat, 23 May 2020 07:30:13 +0000 (09:30 +0200)]
fs: move fiemap range validation into the file systems instances

Replace fiemap_check_flags with a fiemap_prep helper that also takes the
inode and mapped range, and performs the sanity check and truncation
previously done in fiemap_check_range.  This way the validation is inside
the file system itself and thus properly works for the stacked overlayfs
case as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-7-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoiomap: fix the iomap_fiemap prototype
Christoph Hellwig [Sat, 23 May 2020 07:30:12 +0000 (09:30 +0200)]
iomap: fix the iomap_fiemap prototype

iomap_fiemap should take u64 start and len arguments, just like the
->fiemap prototype.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-6-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agofs: move the fiemap definitions out of fs.h
Christoph Hellwig [Sat, 23 May 2020 07:30:11 +0000 (09:30 +0200)]
fs: move the fiemap definitions out of fs.h

No need to pull the fiemap definitions into almost every file in the
kernel build.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-5-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agofs: mark __generic_block_fiemap static
Christoph Hellwig [Sat, 23 May 2020 07:30:10 +0000 (09:30 +0200)]
fs: mark __generic_block_fiemap static

There is no caller left outside of ioctl.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-4-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: remove the call to fiemap_check_flags in ext4_fiemap
Christoph Hellwig [Sat, 23 May 2020 07:30:09 +0000 (09:30 +0200)]
ext4: remove the call to fiemap_check_flags in ext4_fiemap

iomap_fiemap already calls fiemap_check_flags first thing, so this
additional check is redundant.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200523073016.2944131-3-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: split _ext4_fiemap
Christoph Hellwig [Sat, 23 May 2020 07:30:08 +0000 (09:30 +0200)]
ext4: split _ext4_fiemap

The fiemap and EXT4_IOC_GET_ES_CACHE cases share almost no code, so split
them into entirely separate functions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200523073016.2944131-2-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: fix fiemap size checks for bitmap files
Christoph Hellwig [Tue, 5 May 2020 15:43:15 +0000 (17:43 +0200)]
ext4: fix fiemap size checks for bitmap files

Add an extra validation of the len parameter, as for ext4 some files
might have smaller file size limits than others.  This also means the
redundant size check in ext4_ioctl_get_es_cache can go away, as all
size checking is done in the shared fiemap handler.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200505154324.3226743-3-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: fix EXT4_MAX_LOGICAL_BLOCK macro
Ritesh Harjani [Tue, 5 May 2020 15:43:14 +0000 (17:43 +0200)]
ext4: fix EXT4_MAX_LOGICAL_BLOCK macro

ext4 supports max number of logical blocks in a file to be 0xffffffff.
(This is since ext4_extent's ee_block is __le32).
This means that EXT4_MAX_LOGICAL_BLOCK should be 0xfffffffe (starting
from 0 logical offset). This patch fixes this.

The issue was seen when ext4 moved to iomap_fiemap API and when
overlayfs was mounted on top of ext4. Since overlayfs was missing
filemap_check_ranges(), so it could pass a arbitrary huge length which
lead to overflow of map.m_len logic.

This patch fixes that.

Fixes: d3b6f23f7167 ("ext4: move ext4_fiemap to use iomap framework")
Reported-by: syzbot+77fa5bdb65cc39711820@syzkaller.appspotmail.com
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20200505154324.3226743-2-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoadd comment for ext4_dir_entry_2 file_type member
Jonathan Grant [Fri, 22 May 2020 15:07:58 +0000 (16:07 +0100)]
add comment for ext4_dir_entry_2 file_type member

Signed-off-by: Jonathan Grant <jg@jguk.org>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/ad3290d5-86af-99c1-f9d5-cd1bab710429@jguk.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agojbd2: avoid leaking transaction credits when unreserving handle
Jan Kara [Wed, 20 May 2020 13:31:19 +0000 (15:31 +0200)]
jbd2: avoid leaking transaction credits when unreserving handle

When reserved transaction handle is unused, we subtract its reserved
credits in __jbd2_journal_unreserve_handle() called from
jbd2_journal_stop(). However this function forgets to remove reserved
credits from transaction->t_outstanding_credits and thus the transaction
space that was reserved remains effectively leaked. The leaked
transaction space can be quite significant in some cases and leads to
unnecessarily small transactions and thus reducing throughput of the
journalling machinery. E.g. fsmark workload creating lots of 4k files
was observed to have about 20% lower throughput due to this when ext4 is
mounted with dioread_nolock mount option.

Subtract reserved credits from t_outstanding_credits as well.

CC: stable@vger.kernel.org
Fixes: 8f7d89f36829 ("jbd2: transaction reservation support")
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200520133119.1383-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: drop ext4_journal_free_reserved()
Jan Kara [Wed, 20 May 2020 13:31:18 +0000 (15:31 +0200)]
ext4: drop ext4_journal_free_reserved()

Remove ext4_journal_free_reserved() function. It is never used.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20200520133119.1383-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: mballoc: use lock for checking free blocks while retrying
Ritesh Harjani [Wed, 20 May 2020 06:40:36 +0000 (12:10 +0530)]
ext4: mballoc: use lock for checking free blocks while retrying

Currently while doing block allocation grp->bb_free may be getting
modified if discard is happening in parallel.
For e.g. consider a case where there are lot of threads who have
preallocated lot of blocks and there is a thread which is trying
to discard all of this group's PA. Now it could happen that
we see all of those group's bb_free is zero and fail the allocation
while there is sufficient space if we free up all the PA.

So this patch adds another flag "EXT4_MB_STRICT_CHECK" which will be set
if we are unable to allocate any blocks in the first try (since we may
not have considered blocks about to be discarded from PA lists).
So during retry attempt to allocate blocks we will use ext4_lock_group()
for checking if the group is good or not.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/9cb740a117c958c36596f167b12af1beae9a68b7.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: mballoc: refactor ext4_mb_good_group()
Ritesh Harjani [Wed, 20 May 2020 06:40:35 +0000 (12:10 +0530)]
ext4: mballoc: refactor ext4_mb_good_group()

ext4_mb_good_group() definition was changed some time back
and now it even initializes the buddy cache (via ext4_mb_init_group()),
if in case the EXT4_MB_GRP_NEED_INIT() is true for a group.
Note that ext4_mb_init_group() could sleep and so should not be called
under a spinlock held.
This is fine as of now because ext4_mb_good_group() is called before
loading the buddy bitmap without ext4_lock_group() held
and again called after loading the bitmap, only this time with
ext4_lock_group() held.
But still this whole thing is confusing.

So this patch refactors out ext4_mb_good_group_nolock() which should be
called when without holding ext4_lock_group().
Also in further patches we hold the spinlock (ext4_lock_group()) while
doing any calculations which involves grp->bb_free or grp->bb_fragments.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/d9f7d031a5fbe1c943fae6bf1ff5cdf0604ae722.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling
Ritesh Harjani [Wed, 20 May 2020 06:40:34 +0000 (12:10 +0530)]
ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling

There could be a race in function ext4_mb_discard_group_preallocations()
where the 1st thread may iterate through group's bb_prealloc_list and
remove all the PAs and add to function's local list head.
Now if the 2nd thread comes in to discard the group preallocations,
it will see that the group->bb_prealloc_list is empty and will return 0.

Consider for a case where we have less number of groups
(for e.g. just group 0),
this may even return an -ENOSPC error from ext4_mb_new_blocks()
(where we call for ext4_mb_discard_group_preallocations()).
But that is wrong, since 2nd thread should have waited for 1st thread
to release all the PAs and should have retried for allocation.
Since 1st thread was anyway going to discard the PAs.

The algorithm using this percpu seq counter goes below:
1. We sample the percpu discard_pa_seq counter before trying for block
   allocation in ext4_mb_new_blocks().
2. We increment this percpu discard_pa_seq counter when we either allocate
   or free these blocks i.e. while marking those blocks as used/free in
   mb_mark_used()/mb_free_blocks().
3. We also increment this percpu seq counter when we successfully identify
   that the bb_prealloc_list is not empty and hence proceed for discarding
   of those PAs inside ext4_mb_discard_group_preallocations().

Now to make sure that the regular fast path of block allocation is not
affected, as a small optimization we only sample the percpu seq counter
on that cpu. Only when the block allocation fails and when freed blocks
found were 0, that is when we sample percpu seq counter for all cpus using
below function ext4_get_discard_pa_seq_sum(). This happens after making
sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty.

It can be well argued that why don't just check for grp->bb_free to
see if there are any free blocks to be allocated. So here are the two
concerns which were discussed:-

1. If for some reason the blocks available in the group are not
   appropriate for allocation logic (say for e.g.
   EXT4_MB_HINT_GOAL_ONLY, although this is not yet implemented), then
   the retry logic may result into infinte looping since grp->bb_free is
   non-zero.

2. Also before preallocation was clubbed with block allocation with the
   same ext4_lock_group() held, there were lot of races where grp->bb_free
   could not be reliably relied upon.
Due to above, this patch considers discard_pa_seq logic to determine if
we should retry for block allocation. Say if there are are n threads
trying for block allocation and none of those could allocate or discard
any of the blocks, then all of those n threads will fail the block
allocation and return -ENOSPC error. (Since the seq counter for all of
those will match as no block allocation/discard was done during that
duration).

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/7f254686903b87c419d798742fd9a1be34f0657b.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: mballoc: refactor ext4_mb_discard_preallocations()
Ritesh Harjani [Wed, 20 May 2020 06:40:33 +0000 (12:10 +0530)]
ext4: mballoc: refactor ext4_mb_discard_preallocations()

Implement ext4_mb_discard_preallocations_should_retry()
which we will need in later patches to add more logic
like check for sequence number match to see if we should
retry for block allocation or not.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/1cfae0098d2aa9afbeb59331401258182868c8f2.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: mballoc: add blocks to PA list under same spinlock after allocating blocks
Ritesh Harjani [Wed, 20 May 2020 06:40:32 +0000 (12:10 +0530)]
ext4: mballoc: add blocks to PA list under same spinlock after allocating blocks

ext4_mb_discard_preallocations() only checks for grp->bb_prealloc_list
of every group to discard the group's PA to free up the space if
allocation request fails. Consider below race:-

Process A   Process B

1. allocate blocks
1. Fails block allocation from
     ext4_mb_regular_allocator()
   ext4_lock_group()
allocated blocks
more than ac_o_ex.fe_len
   ext4_unlock_group()
2. Scans the
   grp->bb_prealloc_list (under
   ext4_lock_group()) and
   find nothing and thus return
   -ENOSPC.

2. Add the additional blocks to PA list

   ext4_lock_group()
    add blocks to grp->bb_prealloc_list
   ext4_unlock_group()

Above race could be avoided if we add those additional blocks to
grp->bb_prealloc_list at the same time with block allocation when
ext4_lock_group() was still held.
With this discard-PA will know if there are actually any blocks which
could be freed from the PA

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/a2217dd782585b42328981832e6d396abaaccb80.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: add casefold flag to EXT4_INODE_* flags
Eric Biggers [Sun, 10 May 2020 21:52:52 +0000 (14:52 -0700)]
ext4: add casefold flag to EXT4_INODE_* flags

No one currently needs EXT4_INODE_CASEFOLD, but add it to keep the
EXT4_INODE_* definitions in sync with the EXT4_*_FL definitions.

Also make it clearer that the casefold flag is only for directories.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20200510215252.87833-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: rework map struct instantiation in ext4_ext_map_blocks()
Eric Whitney [Sun, 10 May 2020 15:58:05 +0000 (11:58 -0400)]
ext4: rework map struct instantiation in ext4_ext_map_blocks()

The path performing block allocations in ext4_ext_map_blocks() contains
code trimming the length of a new extent that is repeated later
in the function.  This code is both redundant and unnecessary as the
exact length of the new extent has already been calculated.  Rewrite the
instantiation of the map struct in this case to use the available
values, avoiding the overhead of unnecessary conversions and improving
clarity.  Add another map struct instantiation tailored specifically to
the separate case for an existing written extent.  Remove an old comment
that no longer appears applicable to the current code.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200510155805.18808-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
4 years agoext4: make ext_debug() implementation to use pr_debug()
Ritesh Harjani [Sun, 10 May 2020 06:24:55 +0000 (11:54 +0530)]
ext4: make ext_debug() implementation to use pr_debug()

ext_debug() msgs could be helpful, provided those could be enabled
without recompiling kernel and also if we could selectively enable
only required prints for case by case debugging.

So make ext_debug() implementation use pr_debug().
Also change ext_debug() to be defined with CONFIG_EXT4_DEBUG.
So EXT_DEBUG macro now mostly remain for below 3 functions.
ext4_ext_show_path/leaf/move() (whose print msgs use ext_debug()
which again could be dynamically enabled using pr_debug())

This also changes the ext_debug() to take inode as a parameter
to add inode no. in all of it's msgs.
Prints additional info like process name / pid, superblock id etc.
This also removes any explicit function names passed in ext_debug().
Since ext_debug() on it's own prints file, func and line no.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/d31dc189b0aeda9384fe7665e36da7cd8c61571f.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: mballoc: make mb_debug() implementation to use pr_debug()
Ritesh Harjani [Sun, 10 May 2020 06:24:54 +0000 (11:54 +0530)]
ext4: mballoc: make mb_debug() implementation to use pr_debug()

mb_debug() msg had only 1 control level for all type of msgs.
And if we enable mballoc_debug then all of those msgs would be enabled.
Instead of adding multiple debug levels for mb_debug() msgs, use
pr_debug() with which we could have finer control to print msgs at all
of different levels (i.e. at file, func, line no.).

Also add process name/pid, superblk id, and other info in mb_debug()
msg. This also kills the mballoc_debug module parameter, since it is
not needed any more.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/f0c660cbde9e2edbe95c67942ca9ad80dd2231eb.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: replace EXT_DEBUG with __maybe_unused in ext4_ext_handle_unwritten_extents()
Ritesh Harjani [Sun, 10 May 2020 06:24:53 +0000 (11:54 +0530)]
ext4: replace EXT_DEBUG with __maybe_unused in ext4_ext_handle_unwritten_extents()

Replace EXT_DEBUG with __maybe_unused from inside
ext4_ext_handle_unwritten_extents() function.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/ae335b94506cd9db9d2648c1f4dd25a80f9f3ce2.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: improve ext_debug() msg in case of block allocation failure
Ritesh Harjani [Sun, 10 May 2020 06:24:52 +0000 (11:54 +0530)]
ext4: improve ext_debug() msg in case of block allocation failure

ext4_map_blocks() has ext_debug msg early at the start of function.
We also get ext_debug msg if we could allocate a block from
ext4_ext_map_blocks(). But there is no ext_debug() msg in case of
block allocation failure. So add one along with error code.

Also add more info in ext_debug() msg like how many blocks were allocated
v/s how many were requested in ext4_ext_map_blocks().

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/1610ec2aa932396be00f9d552fe29da473ead176.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: use BIT() macro for BH_** state bits
Ritesh Harjani [Sun, 10 May 2020 06:24:51 +0000 (11:54 +0530)]
ext4: use BIT() macro for BH_** state bits

Simply use BIT() macro for all BH_** state bits instead of open
coding it.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/57667689f51a3f9dba2fcef7d3425187fa3ba69f.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: balloc: use task_pid_nr() helper
Ritesh Harjani [Sun, 10 May 2020 06:24:50 +0000 (11:54 +0530)]
ext4: balloc: use task_pid_nr() helper

Use task_pid_nr() function instead of current->pid.
There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/4b58403e15e9c8deb34a1b93deb3fc9cd153ab84.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: mballoc: fix possible NULL ptr & remove BUG_ONs from DOUBLE_CHECK
Ritesh Harjani [Sun, 10 May 2020 06:24:49 +0000 (11:54 +0530)]
ext4: mballoc: fix possible NULL ptr & remove BUG_ONs from DOUBLE_CHECK

Make sure to check for e4b->bd_info->bb_bitmap == NULL, in
mb_cmp_bitmaps() and return if NULL, to avoid possible NULL ptr
dereference. Similar to how we do this in other ifdef DOUBLE_CHECK
functions.

Also remove the BUG_ON() logic if kmalloc() or ext4_read_block_bitmap()
fails. We should simply mark grp->bb_bitmap as NULL if above happens.
In fact ext4_read_block_bitmap() may even return an error in case of resize
ioctl. Hence remove this BUG_ON logic (fstests ext4/032 may trigger
this).

Link: https://lore.kernel.org/r/9a54f8a696ff17c057cd571be3d15ac3ec1407f1.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: mballoc: refactor code inside DOUBLE_CHECK into separate function
Ritesh Harjani [Sun, 10 May 2020 06:24:48 +0000 (11:54 +0530)]
ext4: mballoc: refactor code inside DOUBLE_CHECK into separate function

This patch implemets mb_group_bb_bitmap_alloc() and
mb_group_bb_bitmap_free() function to remove #ifdef DOUBLE_CHECK macro
and it's related code from inside
ext4_mb_add_groupinfo()/ext4_mb_release().

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/8c2095d74b779f0254a19b24982490dc6f07c4f9.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: mballoc: make ext4_mb_use_preallocated() return type as bool
Ritesh Harjani [Sun, 10 May 2020 06:24:47 +0000 (11:54 +0530)]
ext4: mballoc: make ext4_mb_use_preallocated() return type as bool

Change return type of function ext4_mb_use_preallocated() to bool to
better reflect what this function can return.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/7880cb6ef911465beafefcd7e9c3ea214688744b.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: mballoc: simplify error handling in ext4_init_mballoc()
Ritesh Harjani [Sun, 10 May 2020 06:24:46 +0000 (11:54 +0530)]
ext4: mballoc: simplify error handling in ext4_init_mballoc()

This patch simplifies error handling logic in ext4_init_mballoc(),
by adding all the cleanups at one place at the end of that function.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/8621a7bc68f7107a9ac4292afeb784515333bd25.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: mballoc: fix few other format specifier in mb_debug()
Ritesh Harjani [Sun, 10 May 2020 06:24:45 +0000 (11:54 +0530)]
ext4: mballoc: fix few other format specifier in mb_debug()

Fix few other format specifiers in mb_debug() msgs.
As such no other functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/574fa7f833abf2dbf3b53a2fea3195e71f6cdbd8.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: mballoc: correct the mb_debug() format specifier for pa_len var
Ritesh Harjani [Sun, 10 May 2020 06:24:44 +0000 (11:54 +0530)]
ext4: mballoc: correct the mb_debug() format specifier for pa_len var

pa->pa_len is an integer. Fix all of the format specifier used in
mb_debug() for pa_len to %d instead of %u.

As such no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/af4987f643c586f62bcc9961e43f0a67151d5551.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: mballoc: add more mb_debug() msgs
Ritesh Harjani [Sun, 10 May 2020 06:24:43 +0000 (11:54 +0530)]
ext4: mballoc: add more mb_debug() msgs

This patch adds some more debugging mb_debug() msgs to help improve
mballoc code debugging.
Other than adding more mb_debug() msgs at few more places,
there should be no other functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/5fc8e7788b924e211fcfa4a4c1d2f8503511661a.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: mballoc: refactor ext4_mb_show_ac()
Ritesh Harjani [Sun, 10 May 2020 06:24:42 +0000 (11:54 +0530)]
ext4: mballoc: refactor ext4_mb_show_ac()

This factors out ext4_mb_show_pa() function to show all the group's
preallocation info. This could be useful info to be added in later
patches.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/8f07d890b0038dcc935e9c10e6043ec9f3792721.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: mballoc: print bb_free info even when it is 0
Ritesh Harjani [Sun, 10 May 2020 06:24:41 +0000 (11:54 +0530)]
ext4: mballoc: print bb_free info even when it is 0

Improve the debugging msg by also printing even if bb_free is 0.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/c894f1d1d30f86ae38f4e3a861949665b6dc61cd.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: avoid ext4_error()'s caused by ENOMEM in the truncate path
Theodore Ts'o [Thu, 7 May 2020 17:50:28 +0000 (10:50 -0700)]
ext4: avoid ext4_error()'s caused by ENOMEM in the truncate path

We can't fail in the truncate path without requiring an fsck.
Add work around for this by using a combination of retry loops
and the __GFP_NOFAIL flag.

From: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Anna Pendleton <pendleton@google.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20200507175028.15061-1-pendleton@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: fix race between ext4_sync_parent() and rename()
Eric Biggers [Wed, 6 May 2020 18:31:40 +0000 (11:31 -0700)]
ext4: fix race between ext4_sync_parent() and rename()

'igrab(d_inode(dentry->d_parent))' without holding dentry->d_lock is
broken because without d_lock, d_parent can be concurrently changed due
to a rename().  Then if the old directory is immediately deleted, old
d_parent->inode can be NULL.  That causes a NULL dereference in igrab().

To fix this, use dget_parent() to safely grab a reference to the parent
dentry, which pins the inode.  This also eliminates the need to use
d_find_any_alias() other than for the initial inode, as we no longer
throw away the dentry at each step.

This is an extremely hard race to hit, but it is possible.  Adding a
udelay() in between the reads of ->d_parent and its ->d_inode makes it
reproducible on a no-journal filesystem using the following program:

    #include <fcntl.h>
    #include <unistd.h>

    int main()
    {
        if (fork()) {
            for (;;) {
                mkdir("dir1", 0700);
                int fd = open("dir1/file", O_RDWR|O_CREAT|O_SYNC);
                write(fd, "X", 1);
                close(fd);
            }
        } else {
            mkdir("dir2", 0700);
            for (;;) {
                rename("dir1/file", "dir2/file");
                rmdir("dir1");
            }
        }
    }

Fixes: d59729f4e794 ("ext4: fix races in ext4_sync_parent()")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20200506183140.541194-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: fix a typo in a comment
Christophe JAILLET [Sun, 3 May 2020 20:06:47 +0000 (22:06 +0200)]
ext4: fix a typo in a comment

s/extnets/extents/

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/20200503200647.154701-1-christophe.jaillet@wanadoo.fr
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: clean up ext4_ext_convert_to_initialized() error handling
Eric Whitney [Thu, 30 Apr 2020 18:53:20 +0000 (14:53 -0400)]
ext4: clean up ext4_ext_convert_to_initialized() error handling

If ext4_ext_convert_to_initialized() fails when called within
ext4_ext_handle_unwritten_extents(), immediately error out through the
exit point at function end.  Fix the error handling in the event
ext4_ext_convert_to_initialized() returns 0, which it shouldn't do when
converting an existing extent.  The current code returns the passed in
value of allocated (which is likely non-zero) while failing to set
m_flags, m_pblk, and m_len.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200430185320.23001-5-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: clean up GET_BLOCKS_PRE_IO error handling
Eric Whitney [Thu, 30 Apr 2020 18:53:19 +0000 (14:53 -0400)]
ext4: clean up GET_BLOCKS_PRE_IO error handling

If the call to ext4_split_convert_extents() fails in the
EXT4_GET_BLOCKS_PRE_IO case within ext4_ext_handle_unwritten_extents(),
error out through the exit point at function end rather than jumping
through an intermediate point.  Fix the error handling in the event
ext4_split_convert_extents() returns 0, which it shouldn't do when
splitting an existing extent.  The current code returns the passed in
value of allocated (which is likely non-zero) while failing to set
m_flags, m_pblk, and m_len.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200430185320.23001-4-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: remove redundant GET_BLOCKS_CONVERT code
Eric Whitney [Thu, 30 Apr 2020 18:53:18 +0000 (14:53 -0400)]
ext4: remove redundant GET_BLOCKS_CONVERT code

Remove the redundant code assigning values to ext4_map_blocks components
in ext4_ext_handle_unwritten_extents() for the EXT4_GET_BLOCKS_CONVERT
case, using the code at the function exit instead.  Clean up and reorder
that code to eliminate more redundancy and improve readability.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200430185320.23001-3-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: remove dead GET_BLOCKS_ZERO code
Eric Whitney [Thu, 30 Apr 2020 18:53:17 +0000 (14:53 -0400)]
ext4: remove dead GET_BLOCKS_ZERO code

There's no call to ext4_map_blocks() in the current ext4 code with a
flags argument that combines EXT4_GET_BLOCKS_CONVERT and
EXT4_GET_BLOCKS_ZERO.  Remove the code that corresponds to this case
from ext4_ext_handle_unwritten_extents().

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200430185320.23001-2-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: don't ignore return values from ext4_ext_dirty()
Harshad Shirwadkar [Mon, 27 Apr 2020 01:34:38 +0000 (18:34 -0700)]
ext4: don't ignore return values from ext4_ext_dirty()

Don't ignore return values from ext4_ext_dirty, since the errors
indicate valid failures below Ext4.  In all of the other instances of
ext4_ext_dirty calls, the error return value is handled in some
way. This patch makes those remaining couple of places to handle
ext4_ext_dirty errors as well. In case of ext4_split_extent_at(), the
ignorance of return value is intentional. The reason is that we are
already in error path and there isn't much we can do if ext4_ext_dirty
returns error. This patch adds a comment for that case explaining why
we ignore the return value.

In the longer run, we probably should
make sure that errors from other mark_dirty routines are handled as
well.

Ran gce-xfstests smoke tests and verified that there were no
regressions.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200427013438.219117-2-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: handle ext4_mark_inode_dirty errors
Harshad Shirwadkar [Mon, 27 Apr 2020 01:34:37 +0000 (18:34 -0700)]
ext4: handle ext4_mark_inode_dirty errors

ext4_mark_inode_dirty() can fail for real reasons. Ignoring its return
value may lead ext4 to ignore real failures that would result in
corruption / crashes. Harden ext4_mark_inode_dirty error paths to fail
as soon as possible and return errors to the caller whenever
appropriate.

One of the possible scnearios when this bug could affected is that
while creating a new inode, its directory entry gets added
successfully but while writing the inode itself mark_inode_dirty
returns error which is ignored. This would result in inconsistency
that the directory entry points to a non-existent inode.

Ran gce-xfstests smoke tests and verified that there were no
regressions.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20200427013438.219117-1-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: fix error pointer dereference
Jeffle Xu [Thu, 23 Apr 2020 07:46:44 +0000 (15:46 +0800)]
ext4: fix error pointer dereference

Don't pass error pointers to brelse().

commit 7159a986b420 ("ext4: fix some error pointer dereferences") has fixed
some cases, fix the remaining one case.

Once ext4_xattr_block_find()->ext4_sb_bread() failed, error pointer is
stored in @bs->bh, which will be passed to brelse() in the cleanup
routine of ext4_xattr_set_handle(). This will then cause a NULL panic
crash in __brelse().

BUG: unable to handle kernel NULL pointer dereference at 000000000000005b
RIP: 0010:__brelse+0x1b/0x50
Call Trace:
 ext4_xattr_set_handle+0x163/0x5d0
 ext4_xattr_set+0x95/0x110
 __vfs_setxattr+0x6b/0x80
 __vfs_setxattr_noperm+0x68/0x1b0
 vfs_setxattr+0xa0/0xb0
 setxattr+0x12c/0x1a0
 path_setxattr+0x8d/0xc0
 __x64_sys_setxattr+0x27/0x30
 do_syscall_64+0x60/0x250
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

In this case, @bs->bh stores '-EIO' actually.

Fixes: fb265c9cb49e ("ext4: add ext4_sb_bread() to disambiguate ENOMEM cases")
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: stable@kernel.org # 2.6.19
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/1587628004-95123-1-git-send-email-jefflexu@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: Avoid freeing inodes on dirty list
Jan Kara [Tue, 21 Apr 2020 08:54:45 +0000 (10:54 +0200)]
ext4: Avoid freeing inodes on dirty list

When we are evicting inode with journalled data, we may race with
transaction commit in the following way:

CPU0 CPU1
jbd2_journal_commit_transaction() evict(inode)
  inode_io_list_del()
  inode_wait_for_writeback()
  process BJ_Forget list
    __jbd2_journal_insert_checkpoint()
    __jbd2_journal_refile_buffer()
      __jbd2_journal_unfile_buffer()
        if (test_clear_buffer_jbddirty(bh))
          mark_buffer_dirty(bh)
    __mark_inode_dirty(inode)
  ext4_evict_inode(inode)
    frees the inode

This results in use-after-free issues in the writeback code (or
the assertion added in the previous commit triggering).

Fix the problem by removing inode from writeback lists once all the page
cache is evicted and so inode cannot be added to writeback lists again.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200421085445.5731-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agowriteback: Export inode_io_list_del()
Jan Kara [Tue, 21 Apr 2020 08:54:44 +0000 (10:54 +0200)]
writeback: Export inode_io_list_del()

Ext4 needs to remove inode from writeback lists after it is out of
visibility of its journalling machinery (which can still dirty the
inode). Export inode_io_list_del() for it.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200421085445.5731-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: fix buffer_head refcnt leak when ext4_iget() fails
Xiyu Yang [Thu, 23 Apr 2020 05:09:27 +0000 (13:09 +0800)]
ext4: fix buffer_head refcnt leak when ext4_iget() fails

ext4_orphan_get() invokes ext4_read_inode_bitmap(), which returns a
reference of the specified buffer_head object to "bitmap_bh" with
increased refcnt.

When ext4_orphan_get() returns, local variable "bitmap_bh" becomes
invalid, so the refcount should be decreased to keep refcount balanced.

The reference counting issue happens in one exception handling path of
ext4_orphan_get(). When ext4_iget() fails, the function forgets to
decrease the refcnt increased by ext4_read_inode_bitmap(), causing a
refcnt leak.

Fix this issue by calling brelse() when ext4_iget() fails.

Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/1587618568-13418-1-git-send-email-xiyuyang19@fudan.edu.cn
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: fix EXT_MAX_EXTENT/INDEX to check for zeroed eh_max
Harshad Shirwadkar [Tue, 21 Apr 2020 02:39:59 +0000 (19:39 -0700)]
ext4: fix EXT_MAX_EXTENT/INDEX to check for zeroed eh_max

If eh->eh_max is 0, EXT_MAX_EXTENT/INDEX would evaluate to unsigned
(-1) resulting in illegal memory accesses. Although there is no
consistent repro, we see that generic/019 sometimes crashes because of
this bug.

Ran gce-xfstests smoke and verified that there were no regressions.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20200421023959.20879-2-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
4 years agoext4: remove unnecessary comparisons to bool
Jason Yan [Mon, 20 Apr 2020 04:29:18 +0000 (12:29 +0800)]
ext4: remove unnecessary comparisons to bool

Fix the following coccicheck warning:

fs/ext4/extents_status.c:1057:5-28: WARNING: Comparison to bool
fs/ext4/inode.c:2314:18-24: WARNING: Comparison to bool

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200420042918.19459-1-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: translate a few more map flags to strings in tracepoints
Eric Whitney [Wed, 15 Apr 2020 20:31:40 +0000 (16:31 -0400)]
ext4: translate a few more map flags to strings in tracepoints

As new ext4_map_blocks() flags have been added, not all have gotten flag
bit to string translations to make tracepoint output more readable.
Fix that, and go one step further by adding a translation for the
EXT4_EX_NOCACHE flag as well.  The EXT4_EX_FORCE_CACHE flag can never
be set in a tracepoint in the current code, so there's no need to
bother with a translation for it right now.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200415203140.30349-3-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: remove EXT4_GET_BLOCKS_KEEP_SIZE flag
Eric Whitney [Wed, 15 Apr 2020 20:31:39 +0000 (16:31 -0400)]
ext4: remove EXT4_GET_BLOCKS_KEEP_SIZE flag

The eofblocks code was removed in the 5.7 release by "ext4: remove
EOFBLOCKS_FL and associated code" (4337ecd1fe99).  The ext4_map_blocks()
flag used to trigger it can now be removed as well.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200415203140.30349-2-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: fix a style issue in fs/ext4/acl.c
Carlos Guerrero Álvarez [Thu, 16 Apr 2020 14:14:56 +0000 (16:14 +0200)]
ext4: fix a style issue in fs/ext4/acl.c

Fixed an if statement where braces were not needed.

Link: https://lore.kernel.org/r/20200416141456.1089-1-carlosteniswarrior@gmail.com
Signed-off-by: Carlos Guerrero Álvarez <carlosteniswarrior@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
4 years agoDocumentation/dax: Update DAX enablement for ext4
Ira Weiny [Thu, 28 May 2020 15:00:03 +0000 (08:00 -0700)]
Documentation/dax: Update DAX enablement for ext4

Update the document to reflect ext4 and xfs now behave the same.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20200528150003.828793-10-ira.weiny@intel.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agofs/ext4: Introduce DAX inode flag
Ira Weiny [Thu, 28 May 2020 15:00:02 +0000 (08:00 -0700)]
fs/ext4: Introduce DAX inode flag

Add a flag ([EXT4|FS]_DAX_FL) to preserve FS_XFLAG_DAX in the ext4
inode.

Set the flag to be user visible and changeable.  Set the flag to be
inherited.  Allow applications to change the flag at any time except if
it conflicts with the set of mutually exclusive flags (Currently VERITY,
ENCRYPT, JOURNAL_DATA).

Furthermore, restrict setting any of the exclusive flags if DAX is set.

While conceptually possible, we do not allow setting EXT4_DAX_FL while
at the same time clearing exclusion flags (or vice versa) for 2 reasons:

1) The DAX flag does not take effect immediately which
   introduces quite a bit of complexity
2) There is no clear use case for being this flexible

Finally, on regular files, flag the inode to not be cached to facilitate
changing S_DAX on the next creation of the inode.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20200528150003.828793-9-ira.weiny@intel.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agofs/ext4: Remove jflag variable
Ira Weiny [Thu, 28 May 2020 15:00:01 +0000 (08:00 -0700)]
fs/ext4: Remove jflag variable

The jflag variable serves almost no purpose.  Remove it.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20200528150003.828793-8-ira.weiny@intel.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agofs/ext4: Make DAX mount option a tri-state
Ira Weiny [Thu, 28 May 2020 15:00:00 +0000 (08:00 -0700)]
fs/ext4: Make DAX mount option a tri-state

We add 'always', 'never', and 'inode' (default).  '-o dax' continues to
operate the same which is equivalent to 'always'.  This new
functionality is limited to ext4 only.

Specifically we introduce a 2nd DAX mount flag EXT4_MOUNT2_DAX_NEVER and set
it and EXT4_MOUNT_DAX_ALWAYS appropriately for the mode.

We also force EXT4_MOUNT2_DAX_NEVER if !CONFIG_FS_DAX.

Finally, EXT4_MOUNT2_DAX_INODE is used solely to detect if the user
specified that option for printing.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20200528150003.828793-7-ira.weiny@intel.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agofs/ext4: Only change S_DAX on inode load
Ira Weiny [Thu, 28 May 2020 14:59:59 +0000 (07:59 -0700)]
fs/ext4: Only change S_DAX on inode load

To prevent complications with in memory inodes we only set S_DAX on
inode load.  FS_XFLAG_DAX can be changed at any time and S_DAX will
change after inode eviction and reload.

Add init bool to ext4_set_inode_flags() to indicate if the inode is
being newly initialized.

Assert that S_DAX is not set on an inode which is just being loaded.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20200528150003.828793-6-ira.weiny@intel.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agofs/ext4: Update ext4_should_use_dax()
Ira Weiny [Thu, 28 May 2020 14:59:58 +0000 (07:59 -0700)]
fs/ext4: Update ext4_should_use_dax()

S_DAX should only be enabled when the underlying block device supports
dax.

Cache the underlying support for DAX in the super block and modify
ext4_should_use_dax() to check for device support prior to the over
riding mount option.

While we are at it change the function to ext4_should_enable_dax() as
this better reflects the ask as well as matches xfs.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20200528150003.828793-5-ira.weiny@intel.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agofs/ext4: Change EXT4_MOUNT_DAX to EXT4_MOUNT_DAX_ALWAYS
Ira Weiny [Thu, 28 May 2020 14:59:57 +0000 (07:59 -0700)]
fs/ext4: Change EXT4_MOUNT_DAX to EXT4_MOUNT_DAX_ALWAYS

In prep for the new tri-state mount option which then introduces
EXT4_MOUNT_DAX_NEVER.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20200528150003.828793-4-ira.weiny@intel.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agofs/ext4: Disallow verity if inode is DAX
Ira Weiny [Thu, 28 May 2020 14:59:56 +0000 (07:59 -0700)]
fs/ext4: Disallow verity if inode is DAX

Verity and DAX are incompatible.  Changing the DAX mode due to a verity
flag change is wrong without a corresponding address_space_operations
update.

Make the 2 options mutually exclusive by returning an error if DAX was
set first.

(Setting DAX is already disabled if Verity is set first.)

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20200528150003.828793-3-ira.weiny@intel.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agofs/ext4: Narrow scope of DAX check in setflags
Ira Weiny [Thu, 28 May 2020 14:59:55 +0000 (07:59 -0700)]
fs/ext4: Narrow scope of DAX check in setflags

When preventing DAX and journaling on an inode.  Use the effective DAX
check rather than the mount option.

This will be required to support per inode DAX flags.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20200528150003.828793-2-ira.weiny@intel.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoext4: remove redundant variable has_bigalloc in ext4_fill_super
Kaixu Xia [Wed, 15 Apr 2020 07:25:42 +0000 (15:25 +0800)]
ext4: remove redundant variable has_bigalloc in ext4_fill_super

We can use the ext4_has_feature_bigalloc() function directly to check
bigalloc feature and the variable has_bigalloc is reduncant, so remove
it.

Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/1586935542-29588-1-git-send-email-kaixuxia@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agofs: Introduce DCACHE_DONTCACHE
Ira Weiny [Thu, 30 Apr 2020 14:41:37 +0000 (07:41 -0700)]
fs: Introduce DCACHE_DONTCACHE

DCACHE_DONTCACHE indicates a dentry should not be cached on final
dput().

Also add a helper function to mark DCACHE_DONTCACHE on all dentries
pointing to a specific inode when that inode is being set I_DONTCACHE.

This facilitates dropping dentry references to inodes sooner which
require eviction to swap S_DAX mode.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
4 years agofs: Lift XFS_IDONTCACHE to the VFS layer
Ira Weiny [Thu, 30 Apr 2020 14:41:37 +0000 (07:41 -0700)]
fs: Lift XFS_IDONTCACHE to the VFS layer

DAX effective mode (S_DAX) changes requires inode eviction.

XFS has an advisory flag (XFS_IDONTCACHE) to prevent caching of the
inode if no other additional references are taken.  We lift this flag to
the VFS layer and change the behavior slightly by allowing the flag to
remain even if multiple references are taken.

This will expedite the eviction of inodes to change S_DAX.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
4 years agoext4: remove unnecessary test_opt for DIOREAD_NOLOCK
Kaixu Xia [Mon, 13 Apr 2020 04:24:22 +0000 (12:24 +0800)]
ext4: remove unnecessary test_opt for DIOREAD_NOLOCK

The DIOREAD_NOLOCK flag has been cleared when doing the test_opt
that is meaningless, so remove the unnecessary test_opt for DIOREAD_NOLOCK.

Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Link: https://lore.kernel.org/r/1586751862-19437-1-git-send-email-kaixuxia@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
4 years agoDocumentation/dax: Update Usage section
Ira Weiny [Thu, 30 Apr 2020 14:41:34 +0000 (07:41 -0700)]
Documentation/dax: Update Usage section

Update the Usage section to reflect the new individual dax selection
functionality.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
4 years agofs/stat: Define DAX statx attribute
Ira Weiny [Thu, 30 Apr 2020 14:41:34 +0000 (07:41 -0700)]
fs/stat: Define DAX statx attribute

In order for users to determine if a file is currently operating in DAX
state (effective DAX).  Define a statx attribute value and set that
attribute if the effective DAX flag is set.

To go along with this we propose the following addition to the statx man
page:

STATX_ATTR_DAX

The file is in the DAX (cpu direct access) state.  DAX state
attempts to minimize software cache effects for both I/O and
memory mappings of this file.  It requires a file system which
has been configured to support DAX.

DAX generally assumes all accesses are via cpu load / store
instructions which can minimize overhead for small accesses, but
may adversely affect cpu utilization for large transfers.

File I/O is done directly to/from user-space buffers and memory
mapped I/O may be performed with direct memory mappings that
bypass kernel page cache.

While the DAX property tends to result in data being transferred
synchronously, it does not give the same guarantees of O_SYNC
where data and the necessary metadata are transferred together.

A DAX file may support being mapped with the MAP_SYNC flag,
which enables a program to use CPU cache flush instructions to
persist CPU store operations without an explicit fsync(2).  See
mmap(2) for more information.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
4 years agofs: Remove unneeded IS_DAX() check in io_is_direct()
Ira Weiny [Thu, 30 Apr 2020 14:41:33 +0000 (07:41 -0700)]
fs: Remove unneeded IS_DAX() check in io_is_direct()

Remove the check because DAX now has it's own read/write methods and
file systems which support DAX check IS_DAX() prior to IOCB_DIRECT on
their own.  Therefore, it does not matter if the file state is DAX when
the iocb flags are created.

Also remove io_is_direct() as it is just a simple flag check.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
4 years agoLinux 5.7-rc4
Linus Torvalds [Sun, 3 May 2020 21:56:04 +0000 (14:56 -0700)]
Linux 5.7-rc4

4 years agoMerge tag 'for-5.7-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave...
Linus Torvalds [Sun, 3 May 2020 18:30:08 +0000 (11:30 -0700)]
Merge tag 'for-5.7-rc3-tag' of git://git./linux/kernel/git/kdave/linux

Pull more btrfs fixes from David Sterba:
 "A few more stability fixes, minor build warning fixes and git url
  fixup:

   - fix partial loss of prealloc extent past i_size after fsync

   - fix potential deadlock due to wrong transaction handle passing via
     journal_info

   - fix gcc 4.8 struct intialization warning

   - update git URL in MAINTAINERS entry"

* tag 'for-5.7-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  MAINTAINERS: btrfs: fix git repo URL
  btrfs: fix gcc-4.8 build warning for struct initializer
  btrfs: transaction: Avoid deadlock due to bad initialization timing of fs_info::journal_info
  btrfs: fix partial loss of prealloc extent past i_size after fsync

4 years agoMerge tag 'iommu-fixes-v5.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 3 May 2020 18:04:57 +0000 (11:04 -0700)]
Merge tag 'iommu-fixes-v5.7-rc3' of git://git./linux/kernel/git/joro/iommu

Pull IOMMU fixes from Joerg Roedel:

 - Fix a memory leak when dev_iommu gets freed and a sub-pointer does
   not

 - Build dependency fixes for Mediatek, spapr_tce, and Intel IOMMU
   driver

 - Export iommu_group_get_for_dev() only for GPLed modules

 - Fix AMD IOMMU interrupt remapping when x2apic is enabled

 - Fix error path in the QCOM IOMMU driver probe function

* tag 'iommu-fixes-v5.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu/qcom: Fix local_base status check
  iommu: Properly export iommu_group_get_for_dev()
  iommu/vt-d: Use right Kconfig option name
  iommu/amd: Fix legacy interrupt remapping for x2APIC-enabled system
  iommu: spapr_tce: Disable compile testing to fix build on book3s_32 config
  iommu/mediatek: Fix MTK_IOMMU dependencies
  iommu: Fix the memory leak in dev_iommu_free()

4 years agoMAINTAINERS: btrfs: fix git repo URL
Eric Biggers [Fri, 1 May 2020 23:44:17 +0000 (16:44 -0700)]
MAINTAINERS: btrfs: fix git repo URL

The git repo listed for btrfs hasn't been updated in over a year.
List the current one instead.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agoMerge tag 'pm-5.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Linus Torvalds [Sat, 2 May 2020 20:45:30 +0000 (13:45 -0700)]
Merge tag 'pm-5.7-rc4' of git://git./linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:

 - prevent the intel_pstate driver from printing excessive diagnostic
   messages in some cases (Chris Wilson)

 - make the hibernation restore kernel freeze kernel threads as well as
   user space tasks (Dexuan Cui)

 - fix the ACPI device PM disagnostic messages to include the correct
   power state name (Kai-Heng Feng).

* tag 'pm-5.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: ACPI: Output correct message on target power state
  PM: hibernate: Freeze kernel threads in software_resume()
  cpufreq: intel_pstate: Only mention the BIOS disabling turbo mode once

4 years agoMerge branches 'pm-cpufreq' and 'pm-sleep'
Rafael J. Wysocki [Sat, 2 May 2020 19:39:17 +0000 (21:39 +0200)]
Merge branches 'pm-cpufreq' and 'pm-sleep'

* pm-cpufreq:
  cpufreq: intel_pstate: Only mention the BIOS disabling turbo mode once

* pm-sleep:
  PM: hibernate: Freeze kernel threads in software_resume()

4 years agoMerge tag 'iomap-5.7-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Linus Torvalds [Sat, 2 May 2020 18:31:12 +0000 (11:31 -0700)]
Merge tag 'iomap-5.7-fixes-1' of git://git./fs/xfs/xfs-linux

Pull iomap fix from Darrick Wong:
 "Hoist the check for an unrepresentable FIBMAP return value into
  ioctl_fibmap.

  The internal kernel function can handle 64-bit values (and is needed
  to fix a regression on ext4 + jbd2). It is only the userspace ioctl
  that is so old that it cannot deal"

* tag 'iomap-5.7-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  fibmap: Warn and return an error in case of block > INT_MAX

4 years agoMerge tag 'nfs-for-5.7-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Linus Torvalds [Sat, 2 May 2020 18:24:01 +0000 (11:24 -0700)]
Merge tag 'nfs-for-5.7-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

  Stable fixes:
   - fix handling of backchannel binding in BIND_CONN_TO_SESSION

  Bugfixes:
   - Fix a credential use-after-free issue in pnfs_roc()
   - Fix potential posix_acl refcnt leak in nfs3_set_acl
   - defer slow parts of rpc_free_client() to a workqueue
   - Fix an Oopsable race in __nfs_list_for_each_server()
   - Fix trace point use-after-free race
   - Regression: the RDMA client no longer responds to server disconnect
     requests
   - Fix return values of xdr_stream_encode_item_{present, absent}
   - _pnfs_return_layout() must always wait for layoutreturn completion

  Cleanups:
   - Remove unreachable error conditions"

* tag 'nfs-for-5.7-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFS: Fix a race in __nfs_list_for_each_server()
  NFSv4.1: fix handling of backchannel binding in BIND_CONN_TO_SESSION
  SUNRPC: defer slow parts of rpc_free_client() to a workqueue.
  NFSv4: Remove unreachable error condition due to rpc_run_task()
  SUNRPC: Remove unreachable error condition
  xprtrdma: Fix use of xdr_stream_encode_item_{present, absent}
  xprtrdma: Fix trace point use-after-free race
  xprtrdma: Restore wake-up-all to rpcrdma_cm_event_handler()
  nfs: Fix potential posix_acl refcnt leak in nfs3_set_acl
  NFS/pnfs: Fix a credential use-after-free issue in pnfs_roc()
  NFS/pnfs: Ensure that _pnfs_return_layout() waits for layoutreturn completion

4 years agoMerge tag 'dmaengine-fix-5.7-rc4' of git://git.infradead.org/users/vkoul/slave-dma
Linus Torvalds [Sat, 2 May 2020 18:16:14 +0000 (11:16 -0700)]
Merge tag 'dmaengine-fix-5.7-rc4' of git://git.infradead.org/users/vkoul/slave-dma

Pull dmaengine fixes from Vinod Koul:
 "Core:
   - Documentation typo fixes
   - fix the channel indexes
   - dmatest: fixes for process hang and iterations

  Drivers:
   - hisilicon: build error fix without PCI_MSI
   - ti-k3: deadlock fix
   - uniphier-xdmac: fix for reg region
   - pch: fix data race
   - tegra: fix clock state"

* tag 'dmaengine-fix-5.7-rc4' of git://git.infradead.org/users/vkoul/slave-dma:
  dmaengine: dmatest: Fix process hang when reading 'wait' parameter
  dmaengine: dmatest: Fix iteration non-stop logic
  dmaengine: tegra-apb: Ensure that clock is enabled during of DMA synchronization
  dmaengine: fix channel index enumeration
  dmaengine: mmp_tdma: Reset channel error on release
  dmaengine: mmp_tdma: Do not ignore slave config validation errors
  dmaengine: pch_dma.c: Avoid data race between probe and irq handler
  dt-bindings: dma: uniphier-xdmac: switch to single reg region
  include/linux/dmaengine: Typos fixes in API documentation
  dmaengine: xilinx_dma: Add missing check for empty list
  dmaengine: ti: k3-psil: fix deadlock on error path
  dmaengine: hisilicon: Fix build error without PCI_MSI

4 years agoMerge tag 'vfio-v5.7-rc4' of git://github.com/awilliam/linux-vfio
Linus Torvalds [Sat, 2 May 2020 00:19:15 +0000 (17:19 -0700)]
Merge tag 'vfio-v5.7-rc4' of git://github.com/awilliam/linux-vfio

Pull VFIO fixes from Alex Williamson:

 - copy_*_user validity check for new vfio_dma_rw interface (Yan Zhao)

 - Fix a potential math overflow (Yan Zhao)

 - Use follow_pfn() for calculating PFNMAPs (Sean Christopherson)

* tag 'vfio-v5.7-rc4' of git://github.com/awilliam/linux-vfio:
  vfio/type1: Fix VA->PA translation for PFNMAP VMAs in vaddr_get_pfn()
  vfio: avoid possible overflow in vfio_iommu_type1_pin_pages
  vfio: checking of validity of user vaddr in vfio_dma_rw

4 years agoMerge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Linus Torvalds [Sat, 2 May 2020 00:09:31 +0000 (17:09 -0700)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux

Pull arm64 fix from Catalin Marinas:
 "Add -fasynchronous-unwind-tables to the vDSO CFLAGS"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: vdso: Add -fasynchronous-unwind-tables to cflags

4 years agoMerge tag 'io_uring-5.7-2020-05-01' of git://git.kernel.dk/linux-block
Linus Torvalds [Sat, 2 May 2020 00:03:06 +0000 (17:03 -0700)]
Merge tag 'io_uring-5.7-2020-05-01' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

 - Fix for statx not grabbing the file table, making AT_EMPTY_PATH fail

 - Cover a few cases where async poll can handle retry, eliminating the
   need for an async thread

 - fallback request busy/free fix (Bijan)

 - syzbot reported SQPOLL thread exit fix for non-preempt (Xiaoguang)

 - Fix extra put of req for sync_file_range (Pavel)

 - Always punt splice async. We'll improve this for 5.8, but wanted to
   eliminate the inode mutex lock from the non-blocking path for 5.7
   (Pavel)

* tag 'io_uring-5.7-2020-05-01' of git://git.kernel.dk/linux-block:
  io_uring: punt splice async because of inode mutex
  io_uring: check non-sync defer_list carefully
  io_uring: fix extra put in sync_file_range()
  io_uring: use cond_resched() in io_ring_ctx_wait_and_kill()
  io_uring: use proper references for fallback_req locking
  io_uring: only force async punt if poll based retry can't handle it
  io_uring: enable poll retry for any file with ->read_iter / ->write_iter
  io_uring: statx must grab the file table for valid fd

4 years agoMerge tag 'block-5.7-2020-05-01' of git://git.kernel.dk/linux-block
Linus Torvalds [Fri, 1 May 2020 18:13:36 +0000 (11:13 -0700)]
Merge tag 'block-5.7-2020-05-01' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A few fixes for this release:

   - NVMe pull request from Christoph, with a single fix for a double
     free in the namespace error handling.

   - Kill the bd_openers check in blk_drop_partitions(), fixing a
     regression in this merge window (Christoph)"

* tag 'block-5.7-2020-05-01' of git://git.kernel.dk/linux-block:
  block: remove the bd_openers checks in blk_drop_partitions
  nvme: prevent double free in nvme_alloc_ns() error handling

4 years agoMerge branch 'i2c/for-current-fixed' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 1 May 2020 18:10:09 +0000 (11:10 -0700)]
Merge branch 'i2c/for-current-fixed' of git://git./linux/kernel/git/wsa/linux

Pull i2c fixes from Wolfram Sang:
 "Three driver bugfixes, and two reverts because the original patches
  revealed underlying problems which the Tegra guys are now working on"

* 'i2c/for-current-fixed' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: aspeed: Avoid i2c interrupt status clear race condition.
  i2c: amd-mp2-pci: Fix Oops in amd_mp2_pci_init() error handling
  Revert "i2c: tegra: Better handle case where CPU0 is busy for a long time"
  Revert "i2c: tegra: Synchronize DMA before termination"
  i2c: iproc: generate stop event for slave writes

4 years agoMerge tag 'sound-5.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai...
Linus Torvalds [Fri, 1 May 2020 18:05:28 +0000 (11:05 -0700)]
Merge tag 'sound-5.7-rc4' of git://git./linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
 "Just a collection of small fixes around this time:

   - One more try for fixing PCM OSS regression

   - HD-audio: a new quirk for Lenovo, the improved driver blacklisting,
     a lock fix in the minor error path, and a fix for the possible race
     at monitor notifiaction

   - USB-audio: a quirk ID fix, a fix for POD HD500 workaround"

* tag 'sound-5.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: usb-audio: Correct a typo of NuPrime DAC-10 USB ID
  ALSA: opti9xx: shut up gcc-10 range warning
  ALSA: hda/hdmi: fix without unlocked before return
  ALSA: hda/hdmi: fix race in monitor detection during probe
  ALSA: hda/realtek - Two front mics on a Lenovo ThinkCenter
  ALSA: line6: Fix POD HD500 audio playback
  ALSA: pcm: oss: Place the plugin buffer overflow checks correctly (for 5.7)
  ALSA: pcm: oss: Place the plugin buffer overflow checks correctly
  ALSA: hda: Match both PCI ID and SSID for driver blacklist

4 years agoMerge tag 'drm-fixes-2020-05-01' of git://anongit.freedesktop.org/drm/drm
Linus Torvalds [Fri, 1 May 2020 18:01:51 +0000 (11:01 -0700)]
Merge tag 'drm-fixes-2020-05-01' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Dave Airlie:
 "Regular scheduled fixes for graphics. Nothing to extreme bunch of
  amdgpu fixes, i915 and qxl fixes, along with some misc ones.

  All seems to be progressing normally.

  core:
   - EDID off by one DTD fix
   - DP mst write return code fix

  dma-buf:
   - fix SET_NAME ioctl uapi
   - doc fixes

  amdgpu:
   - Fix a green screen on resume issue
   - PM fixes for SR-IOV SDMA fix for navi
   - Renoir display fixes
   - Cursor and pageflip stuttering fixes
   - Misc additional display fixes
   - (uapi) Add additional DCC tiling flags for navi1x

  i915:
   - Fix selftest refcnt leak (Xiyu)
   - Fix gem vma lock (Chris)
   - Fix gt's i915_request.timeline acquire by checking if cacheline is
     valid (Chris)
   - Fix IRQ postinistall fault masks (Matt)

  qxl:
   - use after gree fix
   - fix lost kunmap
   - release leak fix

  virtio:
   - context destruction fix"

* tag 'drm-fixes-2020-05-01' of git://anongit.freedesktop.org/drm/drm: (26 commits)
  dma-buf: fix documentation build warnings
  drm/qxl: qxl_release use after free
  drm/qxl: lost qxl_bo_kunmap_atomic_page in qxl_image_init_helper()
  drm/i915: Use proper fault mask in interrupt postinstall too
  drm/amd/display: Use cursor locking to prevent flip delays
  drm/amd/display: Update downspread percent to match spreadsheet for DCN2.1
  drm/amd/display: Defer cursor update around VUPDATE for all ASIC
  drm/amd/display: fix rn soc bb update
  drm/amd/display: check if REFCLK_CNTL register is present
  drm/amdgpu: bump version for invalidate L2 before SDMA IBs
  drm/amdgpu: invalidate L2 before SDMA IBs (v2)
  drm/amdgpu: add tiling flags from Mesa
  drm/amd/powerplay: avoid using pm_en before it is initialized revised
  Revert "drm/amd/powerplay: avoid using pm_en before it is initialized"
  drm/qxl: qxl_release leak in qxl_hw_surface_alloc()
  drm/qxl: qxl_release leak in qxl_draw_dirty_fb()
  drm/virtio: only destroy created contexts
  drm/dp_mst: Fix drm_dp_send_dpcd_write() return code
  drm/i915/gt: Check cacheline is valid before acquiring
  drm/i915/gem: Hold obj->vma.lock over for_each_ggtt_vma()
  ...

4 years agoMerge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Linus Torvalds [Fri, 1 May 2020 18:00:07 +0000 (11:00 -0700)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "Four minor fixes: three in drivers and one in the core.

  The core one allows an additional state change that fixes a regression
  introduced by an update to the aacraid driver in the previous merge
  window"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: target/iblock: fix WRITE SAME zeroing
  scsi: qla2xxx: check UNLOADING before posting async work
  scsi: qla2xxx: set UNLOADING before waiting for session deletion
  scsi: core: Allow the state change from SDEV_QUIESCE to SDEV_BLOCK

4 years agoio_uring: punt splice async because of inode mutex
Pavel Begunkov [Fri, 1 May 2020 14:09:38 +0000 (17:09 +0300)]
io_uring: punt splice async because of inode mutex

Nonblocking do_splice() still may wait for some time on an inode mutex.
Let's play safe and always punt it async.

Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: check non-sync defer_list carefully
Pavel Begunkov [Fri, 1 May 2020 14:09:37 +0000 (17:09 +0300)]
io_uring: check non-sync defer_list carefully

io_req_defer() do double-checked locking. Use proper helpers for that,
i.e. list_empty_careful().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: fix extra put in sync_file_range()
Pavel Begunkov [Fri, 1 May 2020 14:09:36 +0000 (17:09 +0300)]
io_uring: fix extra put in sync_file_range()

[   40.179474] refcount_t: underflow; use-after-free.
[   40.179499] WARNING: CPU: 6 PID: 1848 at lib/refcount.c:28 refcount_warn_saturate+0xae/0xf0
...
[   40.179612] RIP: 0010:refcount_warn_saturate+0xae/0xf0
[   40.179617] Code: 28 44 0a 01 01 e8 d7 01 c2 ff 0f 0b 5d c3 80 3d 15 44 0a 01 00 75 91 48 c7 c7 b8 f5 75 be c6 05 05 44 0a 01 01 e8 b7 01 c2 ff <0f> 0b 5d c3 80 3d f3 43 0a 01 00 0f 85 6d ff ff ff 48 c7 c7 10 f6
[   40.179619] RSP: 0018:ffffb252423ebe18 EFLAGS: 00010286
[   40.179623] RAX: 0000000000000000 RBX: ffff98d65e929400 RCX: 0000000000000000
[   40.179625] RDX: 0000000000000001 RSI: 0000000000000086 RDI: 00000000ffffffff
[   40.179627] RBP: ffffb252423ebe18 R08: 0000000000000001 R09: 000000000000055d
[   40.179629] R10: 0000000000000c8c R11: 0000000000000001 R12: 0000000000000000
[   40.179631] R13: ffff98d68c434400 R14: ffff98d6a9cbaa20 R15: ffff98d6a609ccb8
[   40.179634] FS:  0000000000000000(0000) GS:ffff98d6af580000(0000) knlGS:0000000000000000
[   40.179636] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   40.179638] CR2: 00000000033e3194 CR3: 000000006480a003 CR4: 00000000003606e0
[   40.179641] Call Trace:
[   40.179652]  io_put_req+0x36/0x40
[   40.179657]  io_free_work+0x15/0x20
[   40.179661]  io_worker_handle_work+0x2f5/0x480
[   40.179667]  io_wqe_worker+0x2a9/0x360
[   40.179674]  ? _raw_spin_unlock_irqrestore+0x24/0x40
[   40.179681]  kthread+0x12c/0x170
[   40.179685]  ? io_worker_handle_work+0x480/0x480
[   40.179690]  ? kthread_park+0x90/0x90
[   40.179695]  ret_from_fork+0x35/0x40
[   40.179702] ---[ end trace 85027405f00110aa ]---

Opcode handler must never put submission ref, but that's what
io_sync_file_range_finish() do. use io_steal_work() there.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoiommu/qcom: Fix local_base status check
Tang Bin [Sat, 18 Apr 2020 13:47:03 +0000 (21:47 +0800)]
iommu/qcom: Fix local_base status check

The function qcom_iommu_device_probe() does not perform sufficient
error checking after executing devm_ioremap_resource(), which can
result in crashes if a critical error path is encountered.

Fixes: 0ae349a0f33f ("iommu/qcom: Add qcom_iommu")
Signed-off-by: Tang Bin <tangbin@cmss.chinamobile.com>
Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Link: https://lore.kernel.org/r/20200418134703.1760-1-tangbin@cmss.chinamobile.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
4 years agoiommu: Properly export iommu_group_get_for_dev()
Greg Kroah-Hartman [Thu, 30 Apr 2020 12:01:20 +0000 (14:01 +0200)]
iommu: Properly export iommu_group_get_for_dev()

In commit a7ba5c3d008d ("drivers/iommu: Export core IOMMU API symbols to
permit modular drivers") a bunch of iommu symbols were exported, all
with _GPL markings except iommu_group_get_for_dev().  That export should
also be _GPL like the others.

Fixes: a7ba5c3d008d ("drivers/iommu: Export core IOMMU API symbols to permit modular drivers")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: John Garry <john.garry@huawei.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20200430120120.2948448-1-gregkh@linuxfoundation.org
Signed-off-by: Joerg Roedel <jroedel@suse.de>
4 years agoiommu/vt-d: Use right Kconfig option name
Lu Baolu [Fri, 1 May 2020 07:24:27 +0000 (15:24 +0800)]
iommu/vt-d: Use right Kconfig option name

The CONFIG_ prefix should be added in the code.

Fixes: 046182525db61 ("iommu/vt-d: Add Kconfig option to enable/disable scalable mode")
Reported-and-tested-by: Kumar, Sanjay K <sanjay.k.kumar@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Link: https://lore.kernel.org/r/20200501072427.14265-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
4 years agoiommu/amd: Fix legacy interrupt remapping for x2APIC-enabled system
Suravee Suthikulpanit [Wed, 22 Apr 2020 13:30:02 +0000 (08:30 -0500)]
iommu/amd: Fix legacy interrupt remapping for x2APIC-enabled system

Currently, system fails to boot because the legacy interrupt remapping
mode does not enable 128-bit IRTE (GA), which is required for x2APIC
support.

Fix by using AMD_IOMMU_GUEST_IR_LEGACY_GA mode when booting with
kernel option amd_iommu_intr=legacy instead. The initialization
logic will check GASup and automatically fallback to using
AMD_IOMMU_GUEST_IR_LEGACY if GA mode is not supported.

Fixes: 3928aa3f5775 ("iommu/amd: Detect and enable guest vAPIC support")
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/1587562202-14183-1-git-send-email-suravee.suthikulpanit@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
4 years agoio_uring: use cond_resched() in io_ring_ctx_wait_and_kill()
Xiaoguang Wang [Fri, 1 May 2020 00:52:56 +0000 (08:52 +0800)]
io_uring: use cond_resched() in io_ring_ctx_wait_and_kill()

While working on to make io_uring sqpoll mode support syscalls that need
struct files_struct, I got cpu soft lockup in io_ring_ctx_wait_and_kill(),

    while (ctx->sqo_thread && !wq_has_sleeper(&ctx->sqo_wait))
        cpu_relax();

above loop never has an chance to exit, it's because preempt isn't enabled
in the kernel, and the context calling io_ring_ctx_wait_and_kill() and
io_sq_thread() run in the same cpu, if io_sq_thread calls a cond_resched()
yield cpu and another context enters above loop, then io_sq_thread() will
always in runqueue and never exit.

Use cond_resched() can fix this issue.

Reported-by: syzbot+66243bb7126c410cefe6@syzkaller.appspotmail.com
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>