Chaitanya Kulkarni [Fri, 23 Aug 2019 04:45:16 +0000 (21:45 -0700)]
null_blk: create a helper for badblocks
This patch creates a helper for handling badblocks code in the
null_handle_cmd().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chaitanya Kulkarni [Fri, 23 Aug 2019 04:45:15 +0000 (21:45 -0700)]
null_blk: create a helper for throttling
This patch creates a helper for handling throttling code in the
null_handle_cmd().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chaitanya Kulkarni [Fri, 23 Aug 2019 04:45:14 +0000 (21:45 -0700)]
null_blk: move duplicate code to callers
This is a preparation patch which moves the duplicate code for sectors
and nr_sectors calculations for bio vs request mode into their
respective callers (null_queue_bio(), null_qeueue_req()). Now the core
function only deals with the respective actions and commands instead of
having to calculte the bio vs req operations and different sector
related variables. We also move the flush command handling at the top
which significantly simplifies the rest of the code.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 12 Aug 2019 15:39:58 +0000 (17:39 +0200)]
block: move same page handling from __bio_add_pc_page to the callers
Hiding page refcount manipulation inside a low-level bio helper is
somewhat awkward. Instead return the same page information to the
callers, where it fits in much better.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 12 Aug 2019 15:39:57 +0000 (17:39 +0200)]
block: create a bio_try_merge_pc_page helper
Passsthrough bio handling should be the same as normal bio handling,
except that we need to take hardware limitations into account. Thus
use the common try_merge implementation after checking the hardware
limits. This changes behavior in that we now also check segment
and dma boundary settings for same page merges, which is a little
more work but has no effect as those need to be larger than the
page size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 12 Aug 2019 15:39:56 +0000 (17:39 +0200)]
block: improve the gap check in __bio_add_pc_page
If we can add more data into an existing segment we do not create a gap
per definition, so move the check for a gap after the attempt to merge
into the segment.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Mike Christie [Sun, 4 Aug 2019 19:10:06 +0000 (14:10 -0500)]
nbd: fix max number of supported devs
This fixes a bug added in 4.10 with commit:
commit
9561a7ade0c205bc2ee035a2ac880478dcc1a024
Author: Josef Bacik <jbacik@fb.com>
Date: Tue Nov 22 14:04:40 2016 -0500
nbd: add multi-connection support
that limited the number of devices to 256. Before the patch we could
create 1000s of devices, but the patch switched us from using our
own thread to using a work queue which has a default limit of 256
active works.
The problem is that our recv_work function sits in a loop until
disconnection but only handles IO for one connection. The work is
started when the connection is started/restarted, but if we end up
creating 257 or more connections, the queue_work call just queues
connection257+'s recv_work and that waits for connection 1 - 256's
recv_work to be disconnected and that work instance completing.
Instead of reverting back to kthreads, this has us allocate a
workqueue_struct per device, so we can block in the work.
Cc: stable@vger.kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Mike Christie [Tue, 13 Aug 2019 16:39:52 +0000 (11:39 -0500)]
nbd: fix zero cmd timeout handling v2
This fixes a regression added in 4.9 with commit:
commit
0eadf37afc2500e1162c9040ec26a705b9af8d47
Author: Josef Bacik <jbacik@fb.com>
Date: Thu Sep 8 12:33:40 2016 -0700
nbd: allow block mq to deal with timeouts
where before the patch userspace would set the timeout to 0 to disable
it. With the above patch, a zero timeout tells the block layer to use
the default value of 30 seconds. For setups where commands can take a
long time or experience transient issues like network disruptions this
then results in IO errors being sent to the application.
To fix this, the patch still uses the common block layer timeout
framework, but if zero is set, nbd just logs a message and then resets
the timer when it expires.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Mike Christie [Tue, 13 Aug 2019 16:39:51 +0000 (11:39 -0500)]
nbd: add missing config put
Fix bug added with the patch:
commit
8f3ea35929a0806ad1397db99a89ffee0140822a
Author: Josef Bacik <josef@toxicpanda.com>
Date: Mon Jul 16 12:11:35 2018 -0400
nbd: handle unexpected replies better
where if the timeout handler runs when the completion path is and we fail
to grab the mutex in the timeout handler we will leave a config reference
and cannot free the config later.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Mike Christie [Tue, 13 Aug 2019 16:39:50 +0000 (11:39 -0500)]
nbd: add function to convert blk req op to nbd cmd
This adds a helper function to convert a block req op to a nbd cmd type.
It will be used in the last patch to log the type in the timeout
handler.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Mike Christie [Tue, 13 Aug 2019 16:39:49 +0000 (11:39 -0500)]
nbd: add set cmd timeout helper
Add a helper to set the cmd timeout. It does not really do a lot now,
but will be more useful in the next patches.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Revanth Rajashekar [Tue, 20 Aug 2019 15:30:51 +0000 (09:30 -0600)]
block: sed-opal: Removed duplicate OPAL_METHOD_LENGTH definition
The original commit adding the sed-opal library by mistake added two
definitions of OPAL_METHOD_LENGTH, remove one of them.
Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Revanth Rajashekar [Tue, 20 Aug 2019 15:30:50 +0000 (09:30 -0600)]
block: sed-opal: Remove always false conditional statement
In the function 'response_parse', num_entries will never be 0 as
slen is checked for 0. Hence, the condition 'if (num_entries == 0)'
can never be true.
Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Revanth Rajashekar [Tue, 20 Aug 2019 15:30:49 +0000 (09:30 -0600)]
block: sed-opal: Add/remove spaces
Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Junxiao Bi [Fri, 16 Aug 2019 21:12:33 +0000 (14:12 -0700)]
block: remove struct request_queue queue_head
The dispatch list is not used any more, as the legacy block IO stack
has been removed.
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Fri, 2 Aug 2019 19:08:13 +0000 (12:08 -0700)]
writeback, cgroup: inode_switch_wbs() shouldn't give up on wb_switch_rwsem trylock fail
As inode wb switching may make sync(2) miss some inodes, they're
synchronized using wb_switch_rwsem so that no wb switching happens
while sync(2) is in progress. In addition to synchronizing the actual
switching, the rwsem is also used to prevent queueing new switch
attempts while sync(2) is in progress. This is to avoid queueing too
many instances while the rwsem is held by sync(2). Unfortunately,
this is too agressive and can block wb switching for a long time if
sync(2) is frequent.
The goal is avoiding expolding the number of scheduled switches, not
avoiding scheduling anything. Let's use wb_switch_rwsem only for
synchronizing the actual switching and sync(2) and use
isw_nr_in_flight instead for limiting the maximum number of scheduled
switches. The limit is set to 1024 which should be more than enough
while still avoiding extreme situations.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Thu, 15 Aug 2019 19:25:28 +0000 (12:25 -0700)]
writeback, cgroup: Adjust WB_FRN_TIME_CUT_DIV to accelerate foreign inode switching
WB_FRN_TIME_CUT_DIV is used to tell the foreign inode detection logic
to ignore short writeback rounds to prevent getting confused by a
burst of short writebacks. The parameter is currently 2 meaning that
anything smaller than half of the running average writback duration
will be ignored.
This is unnecessarily aggressive. The detection logic uses 16 history
slots and is already reasonably protected against some short bursts
confusing it and the current parameter can lead to tens of seconds of
missed detection depending on the writeback pattern.
Let's change the parameter to 8, so that it only ignores writeback
with are smaller than 12.5% of the current running average.
v2: Add comment explaining what's going on with the foreign detection
parameters.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Johannes Weiner [Thu, 8 Aug 2019 19:03:00 +0000 (15:03 -0400)]
block: annotate refault stalls from IO submission
psi tracks the time tasks wait for refaulting pages to become
uptodate, but it does not track the time spent submitting the IO. The
submission part can be significant if backing storage is contended or
when cgroup throttling (io.latency) is in effect - a lot of time is
spent in submit_bio(). In that case, we underreport memory pressure.
Annotate submit_bio() to account submission time as memory stall when
the bio is reading userspace workingset pages.
Tested-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
zhengbin [Tue, 23 Jul 2019 14:10:42 +0000 (22:10 +0800)]
blk-mq: Fix memory leak in blk_mq_init_allocated_queue error handling
If blk_mq_init_allocated_queue->elevator_init_mq fails, need to release
the previously requested resources.
Fixes:
d34849913819 ("blk-mq-sched: allow setting of default IO scheduler")
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jann Horn [Tue, 26 Mar 2019 22:03:48 +0000 (23:03 +0100)]
floppy: fix usercopy direction
As sparse points out, these two copy_from_user() should actually be
copy_to_user().
Fixes:
229b53c9bf4e ("take floppy compat ioctls to sodding floppy.c")
Cc: stable@vger.kernel.org
Acked-by: Alexander Popov <alex.popov@linux.com>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 9 Aug 2019 04:11:28 +0000 (22:11 -0600)]
lightnvm: remove unused 'geo' variable
A previous commit correctly removed set-but-not-read variables, but
this left two new variables now unused. Kill them.
Fixes:
ba6f7da99aaf ("lightnvm: remove set but not used variables 'data_len' and 'rq_len'")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Alessio Balsini [Wed, 7 Aug 2019 00:48:28 +0000 (01:48 +0100)]
loop: Add LOOP_SET_DIRECT_IO to compat ioctl
Enabling Direct I/O with loop devices helps reducing memory usage by
avoiding double caching. 32 bit applications running on 64 bits systems
are currently not able to request direct I/O because is missing from the
lo_compat_ioctl.
This patch fixes the compatibility issue mentioned above by exporting
LOOP_SET_DIRECT_IO as additional lo_compat_ioctl() entry.
The input argument for this ioctl is a single long converted to a 1-bit
boolean, so compatibility is preserved.
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Alessio Balsini <balsini@android.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Zhou Wang [Wed, 24 Jul 2019 03:54:23 +0000 (11:54 +0800)]
lib: scatterlist: Fix to support no mapped sg
In function sg_split, the second sg_calculate_split will return -EINVAL
when in_mapped_nents is 0.
Indeed there is no need to do second sg_calculate_split and sg_split_mapped
when in_mapped_nents is 0, as in_mapped_nents indicates no mapped entry in
original sgl.
Signed-off-by: Zhou Wang <wangzhou1@hisilicon.com>
Acked-by: Robert Jarzmik <robert.jarzmik@free.fr>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
YueHaibing [Wed, 7 Aug 2019 13:18:47 +0000 (21:18 +0800)]
lightnvm: remove set but not used variables 'data_len' and 'rq_len'
drivers/lightnvm/pblk-read.c: In function pblk_submit_read_gc:
drivers/lightnvm/pblk-read.c:423:6: warning: variable data_len set but not used [-Wunused-but-set-variable]
drivers/lightnvm/pblk-recovery.c: In function pblk_recov_scan_oob:
drivers/lightnvm/pblk-recovery.c:368:15: warning: variable rq_len set but not used [-Wunused-but-set-variable]
They are not used since commit
48e5da725581 ("lightnvm:
move metadata mapping to lower level driver")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 7 Aug 2019 18:26:53 +0000 (12:26 -0600)]
Merge branch 'md-next' of https://github.com/liu-song-6/linux into for-5.4/block
Pull MD changes from Song.
* 'md-next' of https://github.com/liu-song-6/linux:
raid1: factor out a common routine to handle the completion of sync write
md: don't call spare_active in md_reap_sync_thread if all member devices can't work
md: don't set In_sync if array is frozen
md: allow last device to be forcibly removed from RAID1/RAID10.
md: Convert to use int_pow()
md/raid10: end bio when the device faulty
md/raid1: end bio when the device faulty
md/raid6: Set R5_ReadError when there is read failure on parity disk
raid1: use an int as the return value of raise_barrier()
Hou Tao [Sat, 27 Jul 2019 06:02:58 +0000 (14:02 +0800)]
raid1: factor out a common routine to handle the completion of sync write
It's just code clean-up.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Guoqing Jiang [Wed, 24 Jul 2019 09:09:21 +0000 (11:09 +0200)]
md: don't call spare_active in md_reap_sync_thread if all member devices can't work
When add one disk to array, the md_reap_sync_thread is responsible
to activate the spare and set In_sync flag for the new member in
spare_active().
But if raid1 has one member disk A, and disk B is added to the array.
Then we offline A before all the datas are synchronized from A to B,
obviously B doesn't have the latest data as A, but B is still marked
with In_sync flag.
So let's not call spare_active under the condition, otherwise B is
still showed with 'U' state which is not correct.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Guoqing Jiang [Wed, 24 Jul 2019 09:09:20 +0000 (11:09 +0200)]
md: don't set In_sync if array is frozen
When a disk is added to array, the following path is called in mdadm.
Manage_subdevs -> sysfs_freeze_array
-> Manage_add
-> sysfs_set_str(&info, NULL, "sync_action","idle")
Then from kernel side, Manage_add invokes the path (add_new_disk ->
validate_super = super_1_validate) to set In_sync flag.
Since In_sync means "device is in_sync with rest of array", and the new
added disk need to resync thread to help the synchronization of data.
And md_reap_sync_thread would call spare_active to set In_sync for the
new added disk finally. So don't set In_sync if array is in frozen.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Guoqing Jiang [Wed, 24 Jul 2019 09:09:19 +0000 (11:09 +0200)]
md: allow last device to be forcibly removed from RAID1/RAID10.
When the 'last' device in a RAID1 or RAID10 reports an error,
we do not mark it as failed. This would serve little purpose
as there is no risk of losing data beyond that which is obviously
lost (as there is with RAID5), and there could be other sectors
on the device which are readable, and only readable from this device.
This in general this maximises access to data.
However the current implementation also stops an admin from removing
the last device by direct action. This is rarely useful, but in many
case is not harmful and can make automation easier by removing special
cases.
Also, if an attempt to write metadata fails the device must be marked
as faulty, else an infinite loop will result, attempting to update
the metadata on all non-faulty devices.
So add 'fail_last_dev' member to 'struct mddev', then we can bypasses
the 'last disk' checks for RAID1 and RAID10, and control the behavior
per array by change sysfs node.
Signed-off-by: NeilBrown <neilb@suse.de>
[add sysfs node for fail_last_dev by Guoqing]
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Andy Shevchenko [Tue, 23 Jul 2019 20:41:55 +0000 (23:41 +0300)]
md: Convert to use int_pow()
Instead of linear approach to calculate power of 10, use generic int_pow()
which does it better.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Yufen Yu [Fri, 19 Jul 2019 05:48:47 +0000 (13:48 +0800)]
md/raid10: end bio when the device faulty
Just like raid1, we do not queue write error bio to retry write
and acknowlege badblocks, when the device is faulty.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Yufen Yu [Fri, 19 Jul 2019 05:48:46 +0000 (13:48 +0800)]
md/raid1: end bio when the device faulty
When write bio return error, it would be added to conf->retry_list
and wait for raid1d thread to retry write and acknowledge badblocks.
In narrow_write_error(), the error bio will be split in the unit of
badblock shift (such as one sector) and raid1d thread issues them
one by one. Until all of the splited bio has finished, raid1d thread
can go on processing other things, which is time consuming.
But, there is a scene for error handling that is not necessary.
When the device has been set faulty, flush_bio_list() may end
bios in pending_bio_list with error status. Since these bios
has not been issued to the device actually, error handlding to
retry write and acknowledge badblocks make no sense.
Even without that scene, when the device is faulty, badblocks info
can not be written out to the device. Thus, we also no need to
handle the error IO.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Xiao Ni [Mon, 8 Jul 2019 02:14:32 +0000 (10:14 +0800)]
md/raid6: Set R5_ReadError when there is read failure on parity disk
7471fb77ce4d ("md/raid6: Fix anomily when recovering a single device in
RAID6.") avoids rereading P when it can be computed from other members.
However, this misses the chance to re-write the right data to P. This
patch sets R5_ReadError if the re-read fails.
Also, when re-read is skipped, we also missed the chance to reset
rdev->read_errors to 0. It can fail the disk when there are many read
errors on P member disk (other disks don't have read error)
V2: upper layer read request don't read parity/Q data. So there is no
need to consider such situation.
This is Reported-by: kbuild test robot <lkp@intel.com>
Fixes:
7471fb77ce4d ("md/raid6: Fix anomily when recovering a single device in RAID6.")
Cc: <stable@vger.kernel.org> #4.4+
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Hou Tao [Tue, 2 Jul 2019 14:35:48 +0000 (22:35 +0800)]
raid1: use an int as the return value of raise_barrier()
Using a sector_t as the return value is misleading, because
raise_barrier() only return 0 or -EINTR.
Also add comments for the return values of raise_barrier().
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Hans Holmberg [Wed, 31 Jul 2019 09:41:36 +0000 (11:41 +0200)]
block: stop exporting bio_map_kern
Now that there no module users left of bio_map_kern, stop exporting the
symbol.
Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hans Holmberg <hans@owltronix.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hans Holmberg [Wed, 31 Jul 2019 09:41:35 +0000 (11:41 +0200)]
lightnvm: pblk: use kvmalloc for metadata
There is no reason now not to use kvmalloc, so replace the internal
metadata allocation scheme.
Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hans Holmberg <hans@owltronix.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hans Holmberg [Wed, 31 Jul 2019 09:41:34 +0000 (11:41 +0200)]
lightnvm: move metadata mapping to lower level driver
Now that blk_rq_map_kern can map both kmem and vmem, move internal
metadata mapping down to the lower level driver.
Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hans Holmberg <hans@owltronix.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hans Holmberg [Wed, 31 Jul 2019 09:41:33 +0000 (11:41 +0200)]
lightnvm: remove nvm_submit_io_sync_fn
Move the redundant sync handling interface and wait for a completion in
the lightnvm core instead.
Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hans Holmberg <hans@owltronix.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Thu, 25 Jul 2019 09:41:46 +0000 (17:41 +0800)]
blk-mq: balance mapping between present CPUs and queues
Spread queues among present CPUs first, then building mapping on other
non-present CPUs.
So we can minimize count of dead queues which are mapped by un-present
CPUs only. Then bad IO performance can be avoided by unbalanced mapping
between present CPUs and queues.
The similar policy has been applied on Managed IRQ affinity.
Cc: Yi Zhang <yi.zhang@redhat.com>
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Thu, 25 Jul 2019 02:05:00 +0000 (10:05 +0800)]
scsi: implement .cleanup_rq callback
Implement .cleanup_rq() callback for freeing driver private part
of the request. Then we can avoid to leak this part if the request isn't
completed by SCSI, and freed by blk-mq or upper layer(such as dm-rq) finally.
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: <stable@vger.kernel.org>
Fixes:
396eaf21ee17 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Thu, 25 Jul 2019 02:04:59 +0000 (10:04 +0800)]
blk-mq: add callback of .cleanup_rq
SCSI maintains its own driver private data hooked off of each SCSI
request, and the pridate data won't be freed after scsi_queue_rq()
returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE. An upper layer driver
(e.g. dm-rq) may need to retry these SCSI requests, before SCSI has
fully dispatched them, due to a lower level SCSI driver's resource
limitation identified in scsi_queue_rq(). Currently SCSI's per-request
private data is leaked when the upper layer driver (dm-rq) frees and
then retries these requests in response to BLK_STS_RESOURCE or
BLK_STS_DEV_RESOURCE returns from scsi_queue_rq().
This usecase is so specialized that it doesn't warrant training an
existing blk-mq interface (e.g. blk_mq_free_request) to allow SCSI to
account for freeing its driver private data -- doing so would add an
extra branch for handling a special case that all other consumers of
SCSI (and blk-mq) won't ever need to worry about.
So the most pragmatic way forward is to delegate freeing SCSI driver
private data to the upper layer driver (dm-rq). Do so by adding
new .cleanup_rq callback and calling a new blk_mq_cleanup_rq() method
from dm-rq. A following commit will implement the .cleanup_rq() hook
in scsi_mq_ops.
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: <stable@vger.kernel.org>
Fixes:
396eaf21ee17 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chaitanya Kulkarni [Thu, 1 Aug 2019 17:26:38 +0000 (10:26 -0700)]
null_blk: implement REQ_OP_ZONE_RESET_ALL
This patch implements newly introduced zone reset all operation for
null_blk driver.
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chaitanya Kulkarni [Thu, 1 Aug 2019 17:26:37 +0000 (10:26 -0700)]
scsi: implement REQ_OP_ZONE_RESET_ALL
This patch implements the zone reset all operation for sd_zbc.c. We add
a new boolean parameter for the sd_zbc_setup_reset_cmd() to indicate
REQ_OP_ZONE_RESET_ALL command setup. Along with that we add support in
the completion path for the zone reset all.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chaitanya Kulkarni [Thu, 1 Aug 2019 17:26:36 +0000 (10:26 -0700)]
blk-zoned: implement REQ_OP_ZONE_RESET_ALL
This implements REQ_OP_ZONE_RESET_ALL as a special case of the block
device zone reset operations where we just simply issue bio with the
newly introduced req op.
We issue this req op when the number of sectors is equal to the device's
partition's number of sectors and device has no partitions.
We also add support so that blk_op_str() can print the new reset-all
zone operation.
This patch also adds a generic make request check for newly
introduced REQ_OP_ZONE_RESET_ALL req_opf. We simply return error
when queue is zoned and reset-all flag is not set for
REQ_OP_ZONE_RESET_ALL.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chaitanya Kulkarni [Thu, 1 Aug 2019 17:26:35 +0000 (10:26 -0700)]
block: add req op to reset all zones and flag
This patch introduces a new request operation REQ_OP_ZONE_RESET_ALL.
This is useful for the applications like mkfs where it needs to reset
all the zones present on the underlying block device. As part for this
patch we also introduce new QUEUE_FLAG_ZONE_RESETALL which indicates the
queue zone reset all capability and corresponding helper macro.
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 1 Aug 2019 22:39:55 +0000 (15:39 -0700)]
block: Fix a comment in blk_cleanup_queue()
Change a reference to the legacy block layer into a reference to blk-mq.
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 1 Aug 2019 22:39:07 +0000 (15:39 -0700)]
block: Fix spelling in the header above blkg_lookup()
See also commit
8f4236d9008b ("block: remove QUEUE_FLAG_BYPASS and ->bypass") # v5.0.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 1 Aug 2019 22:50:44 +0000 (15:50 -0700)]
block: Improve physical block alignment of split bios
Consider the following example:
* The logical block size is 4 KB.
* The physical block size is 8 KB.
* max_sectors equals (16 KB >> 9) sectors.
* A non-aligned 4 KB and an aligned 64 KB bio are merged into a single
non-aligned 68 KB bio.
The current behavior is to split such a bio into (16 KB + 16 KB + 16 KB
+ 16 KB + 4 KB). The start of none of these five bio's is aligned to a
physical block boundary.
This patch ensures that such a bio is split into four aligned and
one non-aligned bio instead of being split into five non-aligned bios.
This improves performance because most block devices can handle aligned
requests faster than non-aligned requests.
Since the physical block size is larger than or equal to the logical
block size, this patch preserves the guarantee that the returned
value is a multiple of the logical block size.
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 1 Aug 2019 22:50:43 +0000 (15:50 -0700)]
block: Simplify blk_bio_segment_split()
Move the max_sectors check into bvec_split_segs() such that a single
call to that function can do all the necessary checks. This patch
optimizes the fast path further, namely if a bvec fits in a page.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 1 Aug 2019 22:50:42 +0000 (15:50 -0700)]
block: Simplify bvec_split_segs()
Simplify this function by by removing two if-tests. Other than requiring
that the @sectors pointer is not NULL, this patch does not change the
behavior of bvec_split_segs().
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 1 Aug 2019 22:50:41 +0000 (15:50 -0700)]
block: Document the bio splitting functions
Since what the bio splitting functions do is nontrivial, document these
functions.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 1 Aug 2019 22:50:40 +0000 (15:50 -0700)]
block: Declare several function pointer arguments 'const'
Make it clear to the compiler and also to humans that the functions
that query request queue properties do not modify any member of the
request_queue data structure.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 24 Jul 2019 03:48:43 +0000 (11:48 +0800)]
blk-mq: remove blk_mq_complete_request_sync
blk_mq_tagset_wait_completed_request() has been applied for waiting
for completed request's fn, so not necessary to use
blk_mq_complete_request_sync() any more.
Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 24 Jul 2019 03:48:42 +0000 (11:48 +0800)]
nvme: wait until all completed request's complete fn is called
When aborting in-flight request for recovering controller, we have
to make sure that queue's complete function is called on completed
request before moving on. Otherwise, for example, the warning of
WARN_ON_ONCE(qp->mrs_used > 0) in ib_destroy_qp_user() may be
triggered on nvme-rdma.
Fix this issue by using blk_mq_tagset_wait_completed_request.
Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 24 Jul 2019 03:48:41 +0000 (11:48 +0800)]
nvme: don't abort completed request in nvme_cancel_request
Before aborting in-flight requests, all IO queues and their interrupts
have been shutdown. However, request's completion function may not be
done yet because it can be scheduled to run via IPI.
So don't abort one request if it is marked as completed, otherwise
we may abort one normal completed request.
Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 24 Jul 2019 03:48:40 +0000 (11:48 +0800)]
blk-mq: introduce blk_mq_tagset_wait_completed_request()
blk-mq may schedule to call queue's complete function on remote CPU via
IPI, but doesn't provide any way to synchronize the request's complete
fn. The current queue freeze interface can't provide the synchonization
because aborted requests stay at blk-mq queues during EH.
In some driver's EH(such as NVMe), hardware queue's resource may be freed &
re-allocated. If the completed request's complete fn is run finally after the
hardware queue's resource is released, kernel crash will be triggered.
Prepare for fixing this kind of issue by introducing
blk_mq_tagset_wait_completed_request().
Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 24 Jul 2019 03:48:39 +0000 (11:48 +0800)]
blk-mq: introduce blk_mq_request_completed()
NVMe needs this function to decide if one request to be aborted has
been completed in normal IO path already.
So introduce it.
Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Mon, 5 Aug 2019 01:40:12 +0000 (18:40 -0700)]
Linux 5.3-rc3
Linus Torvalds [Sun, 4 Aug 2019 23:39:07 +0000 (16:39 -0700)]
Merge tag 'tpmdd-next-
20190805' of git://git.infradead.org/users/jjs/linux-tpmdd
Pull tpm fixes from Jarkko Sakkinen:
"Two bug fixes that did not make into my first pull request"
* tag 'tpmdd-next-
20190805' of git://git.infradead.org/users/jjs/linux-tpmdd:
tpm: tpm_ibm_vtpm: Fix unallocated banks
tpm: Fix null pointer dereference on chip register error path
Linus Torvalds [Sun, 4 Aug 2019 23:37:08 +0000 (16:37 -0700)]
Merge tag 'mtd/fixes-for-5.3-rc3' of git://git./linux/kernel/git/mtd/linux
Pull MTD fixes from Miquel Raynal:
"NAND:
- Fix Micron driver as some chips enable internal ECC correction
during their discovery while they advertize they do not have any.
Hyperbus:
- Restrict the build to only ARM64 SoCs (and compile testing) which
is what should have been done since the beginning.
- Fix Kconfig issue by selection something instead of implying it"
* tag 'mtd/fixes-for-5.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux:
mtd: hyperbus: Add hardware dependency to AM654 driver
mtd: hyperbus: Kconfig: Fix HBMC_AM654 dependencies
mtd: rawnand: micron: handle on-die "ECC-off" devices correctly
Nayna Jain [Thu, 11 Jul 2019 16:13:35 +0000 (12:13 -0400)]
tpm: tpm_ibm_vtpm: Fix unallocated banks
The nr_allocated_banks and allocated banks are initialized as part of
tpm_chip_register. Currently, this is done as part of auto startup
function. However, some drivers, like the ibm vtpm driver, do not run
auto startup during initialization. This results in uninitialized memory
issue and causes a kernel panic during boot.
This patch moves the pcr allocation outside the auto startup function
into tpm_chip_register. This ensures that allocated banks are initialized
in any case.
Fixes:
879b589210a9 ("tpm: retrieve digest size of unknown algorithms with PCR read")
Reported-by: Michal Suchanek <msuchanek@suse.de>
Signed-off-by: Nayna Jain <nayna@linux.ibm.com>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Tested-by: Michal Suchánek <msuchanek@suse.de>
Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Milan Broz [Thu, 4 Jul 2019 07:26:15 +0000 (09:26 +0200)]
tpm: Fix null pointer dereference on chip register error path
If clk_enable is not defined and chip initialization
is canceled code hits null dereference.
Easily reproducible with vTPM init fail:
swtpm chardev --tpmstate dir=nonexistent_dir --tpm2 --vtpm-proxy
BUG: kernel NULL pointer dereference, address:
00000000
...
Call Trace:
tpm_chip_start+0x9d/0xa0 [tpm]
tpm_chip_register+0x10/0x1a0 [tpm]
vtpm_proxy_work+0x11/0x30 [tpm_vtpm_proxy]
process_one_work+0x214/0x5a0
worker_thread+0x134/0x3e0
? process_one_work+0x5a0/0x5a0
kthread+0xd4/0x100
? process_one_work+0x5a0/0x5a0
? kthread_park+0x90/0x90
ret_from_fork+0x19/0x24
Fixes:
719b7d81f204 ("tpm: introduce tpm_chip_start() and tpm_chip_stop()")
Cc: stable@vger.kernel.org # v5.1+
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Linus Torvalds [Sun, 4 Aug 2019 17:30:47 +0000 (10:30 -0700)]
Merge tag 'powerpc-5.3-3' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Some more powerpc fixes for 5.3:
- Wire up the new clone3 syscall.
- A fix for the PAPR SCM nvdimm driver, to fix a crash when firmware
gives us a device that's attached to a non-online NUMA node.
- A fix for a boot failure on 32-bit with KASAN enabled.
- Three fixes for implicit fall through warnings, some of which are
errors for us due to -Werror.
Thanks to: Aneesh Kumar K.V, Christophe Leroy, Kees Cook, Santosh
Sivaraj, Stephen Rothwell"
* tag 'powerpc-5.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/kasan: fix early boot failure on PPC32
drivers/macintosh/smu.c: Mark expected switch fall-through
powerpc/spe: Mark expected switch fall-throughs
powerpc/nvdimm: Pick nearby online node if the device node is not online
powerpc/kvm: Fall through switch case explicitly
powerpc: Wire up clone3 syscall
Geert Uytterhoeven [Mon, 29 Jul 2019 17:56:58 +0000 (19:56 +0200)]
MAINTAINERS: Add Geert as Renesas SoC Co-Maintainer
At the end of the v5.3 upstream kernel development cycle, Simon will be
stepping down from his role as Renesas SoC maintainer. Starting with
the v5.4 development cycle, Geert is taking over this role.
Add Geert as a co-maintainer, and add his git repository and branch.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Acked-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 4 Aug 2019 17:16:30 +0000 (10:16 -0700)]
Merge tag 'kbuild-fixes-v5.3-2' of git://git./linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- detect missing missing "WITH Linux-syscall-note" for uapi headers
- fix needless rebuild when using Clang
- fix false-positive cc-option in Kconfig when using Clang
- avoid including corrupted .*.cmd files in the modpost stage
- fix warning of 'make vmlinux'
- fix {m,n,x,g}config to not generate the broken .config on the second
save operation.
- some trivial Makefile fixes
* tag 'kbuild-fixes-v5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kconfig: Clear "written" flag to avoid data loss
kbuild: Check for unknown options with cc-option usage in Kconfig and clang
lib/raid6: fix unnecessary rebuild of vpermxor*.c
kbuild: modpost: do not parse unnecessary rules for vmlinux modpost
kbuild: modpost: remove unnecessary dependency for __modpost
kbuild: modpost: handle KBUILD_EXTRA_SYMBOLS only for external modules
kbuild: modpost: include .*.cmd files only when targets exist
kbuild: initialize CLANG_FLAGS correctly in the top Makefile
kbuild: detect missing "WITH Linux-syscall-note" for uapi headers
Linus Torvalds [Sun, 4 Aug 2019 17:02:13 +0000 (10:02 -0700)]
Merge tag 'safesetid-maintainers-correction-5.3-rc2' of git://github.com/micah-morton/linux
Pull SafeSetID maintainer update from Micah Morton:
"Add entry in MAINTAINERS file for SafeSetID LSM"
* tag 'safesetid-maintainers-correction-5.3-rc2' of git://github.com/micah-morton/linux:
Add entry in MAINTAINERS file for SafeSetID LSM
M. Vefa Bicakci [Sat, 3 Aug 2019 10:02:12 +0000 (06:02 -0400)]
kconfig: Clear "written" flag to avoid data loss
Prior to this commit, starting nconfig, xconfig or gconfig, and saving
the .config file more than once caused data loss, where a .config file
that contained only comments would be written to disk starting from the
second save operation.
This bug manifests itself because the SYMBOL_WRITTEN flag is never
cleared after the first call to conf_write, and subsequent calls to
conf_write then skip all of the configuration symbols due to the
SYMBOL_WRITTEN flag being set.
This commit resolves this issue by clearing the SYMBOL_WRITTEN flag
from all symbols before conf_write returns.
Fixes:
8e2442a5f86e ("kconfig: fix missing choice values in auto.conf")
Cc: linux-stable <stable@vger.kernel.org> # 4.19+
Signed-off-by: M. Vefa Bicakci <m.v.b@runbox.com>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Linus Torvalds [Sun, 4 Aug 2019 01:50:52 +0000 (18:50 -0700)]
Merge tag 'xtensa-
20190803' of git://github.com/jcmvbkbc/linux-xtensa
Pull Xtensa fix from Max Filippov:
"Fix build for xtensa cores with coprocessors that was broken by
entry/return abstraction patch"
* tag 'xtensa-
20190803' of git://github.com/jcmvbkbc/linux-xtensa:
xtensa: fix build for cores with coprocessors
Linus Torvalds [Sat, 3 Aug 2019 19:56:34 +0000 (12:56 -0700)]
Merge branch 'i2c/for-current-fixed' of git://git./linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"A set of driver fixes for the I2C subsystem"
* 'i2c/for-current-fixed' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: s3c2410: Mark expected switch fall-through
i2c: at91: fix clk_offset for sama5d2
i2c: at91: disable TXRDY interrupt after sending data
i2c: iproc: Fix i2c master read more than 63 bytes
eeprom: at24: make spd world-readable again
Linus Torvalds [Sat, 3 Aug 2019 17:58:46 +0000 (10:58 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf tooling fixes from Thomas Gleixner:
"A set of updates for perf tools and documentation:
perf header:
- Prevent a division by zero
- Deal with an uninitialized warning proper
libbpf:
- Fix the missiong __WORDSIZE definition for musl & al
UAPI headers:
- Synchronize kernel headers
Documentation:
- Fix the memory units for perf.data size"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
libbpf: fix missing __WORDSIZE definition
perf tools: Fix perf.data documentation units for memory size
perf header: Fix use of unitialized value warning
perf header: Fix divide by zero error if f_header.attr_size==0
tools headers UAPI: Sync if_link.h with the kernel
tools headers UAPI: Sync sched.h with the kernel
tools headers UAPI: Sync usbdevice_fs.h with the kernels to get new ioctl
tools perf beauty: Fix usbdevfs_ioctl table generator to handle _IOC()
tools headers UAPI: Update tools's copy of drm.h headers
tools headers UAPI: Update tools's copy of mman.h headers
tools headers UAPI: Update tools's copy of kvm.h headers
tools include UAPI: Sync x86's syscalls_64.tbl and generic unistd.h to pick up clone3 and pidfd_open
Linus Torvalds [Sat, 3 Aug 2019 17:51:29 +0000 (10:51 -0700)]
Merge branch 'timers-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull vdso timer fixes from Thomas Gleixner:
"A series of commits to deal with the regression caused by the generic
VDSO implementation.
The usage of clock_gettime64() for 32bit compat fallback syscalls
caused seccomp filters to kill innocent processes because they only
allow clock_gettime().
Handle the compat syscalls with clock_gettime() as before, which is
not a functional problem for the VDSO as the legacy compat application
interface is not y2038 safe anyway. It's just extra fallback code
which needs to be implemented on every architecture.
It's opt in for now so that it does not break the compile of already
converted architectures in linux-next. Once these are fixed, the
#ifdeffery goes away.
So much for trying to be smart and reuse code..."
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
arm64: compat: vdso: Use legacy syscalls as fallback
x86/vdso/32: Use 32bit syscall fallback
lib/vdso/32: Provide legacy syscall fallbacks
lib/vdso: Move fallback invocation to the callers
lib/vdso/32: Remove inconsistent NULL pointer checks
Linus Torvalds [Sat, 3 Aug 2019 17:49:45 +0000 (10:49 -0700)]
Merge branch 'irq-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"A small bunch of fixes from the irqchip department:
- Fix a couple of UAF on error paths (RZA1, GICv3 ITS)
- Fix iMX GPCv2 trigger setting
- Add missing of_node_put() on error path in MBIGEN
- Add another bunch of /* fall-through */ to silence warnings"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/renesas-rza1: Fix an use-after-free in rza1_irqc_probe()
irqchip/irq-imx-gpcv2: Forward irq type to parent
irqchip/irq-mbigen: Add of_node_put() before return
irqchip/gic-v3-its: Free unused vpt_page when alloc vpe table fail
irqchip/gic-v3: Mark expected switch fall-through
Linus Torvalds [Sat, 3 Aug 2019 17:43:44 +0000 (10:43 -0700)]
Merge tag 'xfs-5.3-fixes-1' of git://git./fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
- Avoid leaking kernel stack contents to userspace
- Fix a potential null pointer dereference in the dabtree scrub code
* tag 'xfs-5.3-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: Fix possible null-pointer dereferences in xchk_da_btree_block_check_sibling()
xfs: fix stack contents leakage in the v1 inumber ioctls
Linus Torvalds [Sat, 3 Aug 2019 16:20:49 +0000 (09:20 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"17 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
drivers/acpi/scan.c: document why we don't need the device_hotplug_lock
memremap: move from kernel/ to mm/
lib/test_meminit.c: use GFP_ATOMIC in RCU critical section
asm-generic: fix -Wtype-limits compiler warnings
cgroup: kselftest: relax fs_spec checks
mm/memory_hotplug.c: remove unneeded return for void function
mm/migrate.c: initialize pud_entry in migrate_vma()
coredump: split pipe command whitespace before expanding template
page flags: prioritize kasan bits over last-cpuid
ubsan: build ubsan.c more conservatively
kasan: remove clang version check for KASAN_STACK
mm: compaction: avoid 100% CPU usage during compaction when a task is killed
mm: migrate: fix reference check race between __find_get_block() and migration
mm: vmscan: check if mem cgroup is disabled or not before calling memcg slab shrinker
ocfs2: remove set but not used variable 'last_hash'
Revert "kmemleak: allow to coexist with fault injection"
kernel/signal.c: fix a kernel-doc markup
Linus Torvalds [Sat, 3 Aug 2019 15:59:11 +0000 (08:59 -0700)]
Merge tag 'riscv/for-v5.3-rc3' of git://git./linux/kernel/git/riscv/linux
Pull RISC-V fixes from Paul Walmsley:
"Three minor RISC-V-related changes for v5.3-rc3:
- Add build ID to VDSO builds to avoid a double-free in perf when
libelf isn't used
- Align the RV64 defconfig to the output of "make savedefconfig" so
subsequent defconfig patches don't get out of hand
- Drop a superfluous DT property from the FU540 SoC DT data (since it
must be already set in board data that includes it)"
* tag 'riscv/for-v5.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: defconfig: align RV64 defconfig to the output of "make savedefconfig"
riscv: dts: fu540-c000: drop "timebase-frequency"
riscv: Fix perf record without libelf support
David Hildenbrand [Sat, 3 Aug 2019 04:49:29 +0000 (21:49 -0700)]
drivers/acpi/scan.c: document why we don't need the device_hotplug_lock
Let's document why the lock is not needed in acpi_scan_init(), right now
this is not really obvious.
[akpm@linux-foundation.org: fix tpyo]
Link: http://lkml.kernel.org/r/20190731135306.31524-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Christoph Hellwig [Sat, 3 Aug 2019 04:49:26 +0000 (21:49 -0700)]
memremap: move from kernel/ to mm/
memremap.c implements MM functionality for ZONE_DEVICE, so it really
should be in the mm/ directory, not the kernel/ one.
Link: http://lkml.kernel.org/r/20190722094143.18387-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexander Potapenko [Sat, 3 Aug 2019 04:49:22 +0000 (21:49 -0700)]
lib/test_meminit.c: use GFP_ATOMIC in RCU critical section
kmalloc() shouldn't sleep while in RCU critical section, therefore use
GFP_ATOMIC instead of GFP_KERNEL.
The bug was spotted by the 0day kernel testing robot.
Link: http://lkml.kernel.org/r/20190725121703.210874-1-glider@google.com
Fixes:
7e659650cbda ("lib: introduce test_meminit module")
Signed-off-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Qian Cai [Sat, 3 Aug 2019 04:49:19 +0000 (21:49 -0700)]
asm-generic: fix -Wtype-limits compiler warnings
Commit
d66acc39c7ce ("bitops: Optimise get_order()") introduced a
compilation warning because "rx_frag_size" is an "ushort" while
PAGE_SHIFT here is 16.
The commit changed the get_order() to be a multi-line macro where
compilers insist to check all statements in the macro even when
__builtin_constant_p(rx_frag_size) will return false as "rx_frag_size"
is a module parameter.
In file included from ./arch/powerpc/include/asm/page_64.h:107,
from ./arch/powerpc/include/asm/page.h:242,
from ./arch/powerpc/include/asm/mmu.h:132,
from ./arch/powerpc/include/asm/lppaca.h:47,
from ./arch/powerpc/include/asm/paca.h:17,
from ./arch/powerpc/include/asm/current.h:13,
from ./include/linux/thread_info.h:21,
from ./arch/powerpc/include/asm/processor.h:39,
from ./include/linux/prefetch.h:15,
from drivers/net/ethernet/emulex/benet/be_main.c:14:
drivers/net/ethernet/emulex/benet/be_main.c: In function 'be_rx_cqs_create':
./include/asm-generic/getorder.h:54:9: warning: comparison is always
true due to limited range of data type [-Wtype-limits]
(((n) < (1UL << PAGE_SHIFT)) ? 0 : \
^
drivers/net/ethernet/emulex/benet/be_main.c:3138:33: note: in expansion
of macro 'get_order'
adapter->big_page_size = (1 << get_order(rx_frag_size)) * PAGE_SIZE;
^~~~~~~~~
Fix it by moving all of this multi-line macro into a proper function,
and killing __get_order() off.
[akpm@linux-foundation.org: remove __get_order() altogether]
[cai@lca.pw: v2]
Link: http://lkml.kernel.org/r/1564000166-31428-1-git-send-email-cai@lca.pw
Link: http://lkml.kernel.org/r/1563914986-26502-1-git-send-email-cai@lca.pw
Fixes:
d66acc39c7ce ("bitops: Optimise get_order()")
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Bill Wendling <morbo@google.com>
Cc: James Y Knight <jyknight@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chris Down [Sat, 3 Aug 2019 04:49:15 +0000 (21:49 -0700)]
cgroup: kselftest: relax fs_spec checks
On my laptop most memcg kselftests were being skipped because it claimed
cgroup v2 hierarchy wasn't mounted, but this isn't correct. Instead, it
seems current systemd HEAD mounts it with the name "cgroup2" instead of
"cgroup":
% grep cgroup /proc/mounts
cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate 0 0
I can't think of a reason to need to check fs_spec explicitly
since it's arbitrary, so we can just rely on fs_vfstype.
After these changes, `make TARGETS=cgroup kselftest` actually runs the
cgroup v2 tests in more cases.
Link: http://lkml.kernel.org/r/20190723210737.GA487@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Weitao Hou [Sat, 3 Aug 2019 04:49:12 +0000 (21:49 -0700)]
mm/memory_hotplug.c: remove unneeded return for void function
return is unneeded in void function
Link: http://lkml.kernel.org/r/20190723130814.21826-1-houweitaoo@gmail.com
Signed-off-by: Weitao Hou <houweitaoo@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ralph Campbell [Sat, 3 Aug 2019 04:49:08 +0000 (21:49 -0700)]
mm/migrate.c: initialize pud_entry in migrate_vma()
When CONFIG_MIGRATE_VMA_HELPER is enabled, migrate_vma() calls
migrate_vma_collect() which initializes a struct mm_walk but didn't
initialize mm_walk.pud_entry. (Found by code inspection) Use a C
structure initialization to make sure it is set to NULL.
Link: http://lkml.kernel.org/r/20190719233225.12243-1-rcampbell@nvidia.com
Fixes:
8763cb45ab967 ("mm/migrate: new memory migration helper for use with device memory")
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Paul Wise [Sat, 3 Aug 2019 04:49:05 +0000 (21:49 -0700)]
coredump: split pipe command whitespace before expanding template
Save the offsets of the start of each argument to avoid having to update
pointers to each argument after every corename krealloc and to avoid
having to duplicate the memory for the dump command.
Executable names containing spaces were previously being expanded from
%e or %E and then split in the middle of the filename. This is
incorrect behaviour since an argument list can represent arguments with
spaces.
The splitting could lead to extra arguments being passed to the core
dump handler that it might have interpreted as options or ignored
completely.
Core dump handlers that are not aware of this Linux kernel issue will be
using %e or %E without considering that it may be split and so they will
be vulnerable to processes with spaces in their names breaking their
argument list. If their internals are otherwise well written, such as
if they are written in shell but quote arguments, they will work better
after this change than before. If they are not well written, then there
is a slight chance of breakage depending on the details of the code but
they will already be fairly broken by the split filenames.
Core dump handlers that are aware of this Linux kernel issue will be
placing %e or %E as the last item in their core_pattern and then
aggregating all of the remaining arguments into one, separated by
spaces. Alternatively they will be obtaining the filename via other
methods. Both of these will be compatible with the new arrangement.
A side effect from this change is that unknown template types (for
example %z) result in an empty argument to the dump handler instead of
the argument being dropped. This is a desired change as:
It is easier for dump handlers to process empty arguments than dropped
ones, especially if they are written in shell or don't pass each
template item with a preceding command-line option in order to
differentiate between individual template types. Most core_patterns in
the wild do not use options so they can confuse different template types
(especially numeric ones) if an earlier one gets dropped in old kernels.
If the kernel introduces a new template type and a core_pattern uses it,
the core dump handler might not expect that the argument can be dropped
in old kernels.
For example, this can result in security issues when %d is dropped in
old kernels. This happened with the corekeeper package in Debian and
resulted in the interface between corekeeper and Linux having to be
rewritten to use command-line options to differentiate between template
types.
The core_pattern for most core dump handlers is written by the handler
author who would generally not insert unknown template types so this
change should be compatible with all the core dump handlers that exist.
Link: http://lkml.kernel.org/r/20190528051142.24939-1-pabs3@bonedaddy.net
Fixes:
74aadce98605 ("core_pattern: allow passing of arguments to user mode helper when core_pattern is a pipe")
Signed-off-by: Paul Wise <pabs3@bonedaddy.net>
Reported-by: Jakub Wilk <jwilk@jwilk.net> [https://bugs.debian.org/924398]
Reported-by: Paul Wise <pabs3@bonedaddy.net> [https://lore.kernel.org/linux-fsdevel/c8b7ecb8508895bf4adb62a748e2ea2c71854597.camel@bonedaddy.net/]
Suggested-by: Jakub Wilk <jwilk@jwilk.net>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Arnd Bergmann [Sat, 3 Aug 2019 04:49:02 +0000 (21:49 -0700)]
page flags: prioritize kasan bits over last-cpuid
ARM64 randdconfig builds regularly run into a build error, especially
when NUMA_BALANCING and SPARSEMEM are enabled but not SPARSEMEM_VMEMMAP:
#error "KASAN: not enough bits in page flags for tag"
The last-cpuid bits are already contitional on the available space, so
the result of the calculation is a bit random on whether they were
already left out or not.
Adding the kasan tag bits before last-cpuid makes it much more likely to
end up with a successful build here, and should be reliable for
randconfig at least, as long as that does not randomize NR_CPUS or
NODES_SHIFT but uses the defaults.
In order for the modified check to not trigger in the x86 vdso32 code
where all constants are wrong (building with -m32), enclose all the
definitions with an #ifdef.
[arnd@arndb.de: build fix]
Link: http://lkml.kernel.org/r/CAK8P3a3Mno1SWTcuAOT0Wa9VS15pdU6EfnkxLbDpyS55yO04+g@mail.gmail.com
Link: http://lkml.kernel.org/r/20190722115520.3743282-1-arnd@arndb.de
Link: https://lore.kernel.org/lkml/20190618095347.3850490-1-arnd@arndb.de/
Fixes:
2813b9c02962 ("kasan, mm, arm64: tag non slab memory allocated via pagealloc")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Arnd Bergmann [Sat, 3 Aug 2019 04:48:58 +0000 (21:48 -0700)]
ubsan: build ubsan.c more conservatively
objtool points out several conditions that it does not like, depending
on the combination with other configuration options and compiler
variants:
stack protector:
lib/ubsan.o: warning: objtool: __ubsan_handle_type_mismatch()+0xbf: call to __stack_chk_fail() with UACCESS enabled
lib/ubsan.o: warning: objtool: __ubsan_handle_type_mismatch_v1()+0xbe: call to __stack_chk_fail() with UACCESS enabled
stackleak plugin:
lib/ubsan.o: warning: objtool: __ubsan_handle_type_mismatch()+0x4a: call to stackleak_track_stack() with UACCESS enabled
lib/ubsan.o: warning: objtool: __ubsan_handle_type_mismatch_v1()+0x4a: call to stackleak_track_stack() with UACCESS enabled
kasan:
lib/ubsan.o: warning: objtool: __ubsan_handle_type_mismatch()+0x25: call to memcpy() with UACCESS enabled
lib/ubsan.o: warning: objtool: __ubsan_handle_type_mismatch_v1()+0x25: call to memcpy() with UACCESS enabled
The stackleak and kasan options just need to be disabled for this file
as we do for other files already. For the stack protector, we already
attempt to disable it, but this fails on clang because the check is
mixed with the gcc specific -fno-conserve-stack option. According to
Andrey Ryabinin, that option is not even needed, dropping it here fixes
the stackprotector issue.
Link: http://lkml.kernel.org/r/20190722125139.1335385-1-arnd@arndb.de
Link: https://lore.kernel.org/lkml/20190617123109.667090-1-arnd@arndb.de/t/
Link: https://lore.kernel.org/lkml/20190722091050.2188664-1-arnd@arndb.de/t/
Fixes:
d08965a27e84 ("x86/uaccess, ubsan: Fix UBSAN vs. SMAP")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Arnd Bergmann [Sat, 3 Aug 2019 04:48:54 +0000 (21:48 -0700)]
kasan: remove clang version check for KASAN_STACK
asan-stack mode still uses dangerously large kernel stacks of tens of
kilobytes in some drivers, and it does not seem that anyone is working
on the clang bug.
Turn it off for all clang versions to prevent users from accidentally
enabling it once they update to clang-9, and to help automated build
testing with clang-9.
Link: https://bugs.llvm.org/show_bug.cgi?id=38809
Link: http://lkml.kernel.org/r/20190719200347.2596375-1-arnd@arndb.de
Fixes:
6baec880d7a5 ("kasan: turn off asan-stack for clang-8 and earlier")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Sat, 3 Aug 2019 04:48:51 +0000 (21:48 -0700)]
mm: compaction: avoid 100% CPU usage during compaction when a task is killed
"howaboutsynergy" reported via kernel buzilla number 204165 that
compact_zone_order was consuming 100% CPU during a stress test for
prolonged periods of time. Specifically the following command, which
should exit in 10 seconds, was taking an excessive time to finish while
the CPU was pegged at 100%.
stress -m 220 --vm-bytes
1000000000 --timeout 10
Tracing indicated a pattern as follows
stress-3923 [007] 519.106208: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106212: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106216: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106219: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106223: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106227: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106231: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106235: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106238: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106242: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
Note that compaction is entered in rapid succession while scanning and
isolating nothing. The problem is that when a task that is compacting
receives a fatal signal, it retries indefinitely instead of exiting
while making no progress as a fatal signal is pending.
It's not easy to trigger this condition although enabling zswap helps on
the basis that the timing is altered. A very small window has to be hit
for the problem to occur (signal delivered while compacting and
isolating a PFN for migration that is not aligned to SWAP_CLUSTER_MAX).
This was reproduced locally -- 16G single socket system, 8G swap, 30%
zswap configured, vm-bytes
22000000000 using Colin Kings stress-ng
implementation from github running in a loop until the problem hits).
Tracing recorded the problem occurring almost 200K times in a short
window. With this patch, the problem hit 4 times but the task existed
normally instead of consuming CPU.
This problem has existed for some time but it was made worse by commit
cf66f0700c8f ("mm, compaction: do not consider a need to reschedule as
contention"). Before that commit, if the same condition was hit then
locks would be quickly contended and compaction would exit that way.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204165
Link: http://lkml.kernel.org/r/20190718085708.GE24383@techsingularity.net
Fixes:
cf66f0700c8f ("mm, compaction: do not consider a need to reschedule as contention")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org> [5.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Sat, 3 Aug 2019 04:48:47 +0000 (21:48 -0700)]
mm: migrate: fix reference check race between __find_get_block() and migration
buffer_migrate_page_norefs() can race with bh users in the following
way:
CPU1 CPU2
buffer_migrate_page_norefs()
buffer_migrate_lock_buffers()
checks bh refs
spin_unlock(&mapping->private_lock)
__find_get_block()
spin_lock(&mapping->private_lock)
grab bh ref
spin_unlock(&mapping->private_lock)
move page do bh work
This can result in various issues like lost updates to buffers (i.e.
metadata corruption) or use after free issues for the old page.
This patch closes the race by holding mapping->private_lock while the
mapping is being moved to a new page. Ordinarily, a reference can be
taken outside of the private_lock using the per-cpu BH LRU but the
references are checked and the LRU invalidated if necessary. The
private_lock is held once the references are known so the buffer lookup
slow path will spin on the private_lock. Between the page lock and
private_lock, it should be impossible for other references to be
acquired and updates to happen during the migration.
A user had reported data corruption issues on a distribution kernel with
a similar page migration implementation as mainline. The data
corruption could not be reproduced with this patch applied. A small
number of migration-intensive tests were run and no performance problems
were noted.
[mgorman@techsingularity.net: Changelog, removed tracing]
Link: http://lkml.kernel.org/r/20190718090238.GF24383@techsingularity.net
Fixes:
89cb0888ca14 "mm: migrate: provide buffer_migrate_page_norefs()"
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org> [5.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yang Shi [Sat, 3 Aug 2019 04:48:44 +0000 (21:48 -0700)]
mm: vmscan: check if mem cgroup is disabled or not before calling memcg slab shrinker
Shakeel Butt reported premature oom on kernel with
"cgroup_disable=memory" since mem_cgroup_is_root() returns false even
though memcg is actually NULL. The drop_caches is also broken.
It is because commit
aeed1d325d42 ("mm/vmscan.c: generalize
shrink_slab() calls in shrink_node()") removed the !memcg check before
!mem_cgroup_is_root(). And, surprisingly root memcg is allocated even
though memory cgroup is disabled by kernel boot parameter.
Add mem_cgroup_disabled() check to make reclaimer work as expected.
Link: http://lkml.kernel.org/r/1563385526-20805-1-git-send-email-yang.shi@linux.alibaba.com
Fixes:
aeed1d325d42 ("mm/vmscan.c: generalize shrink_slab() calls in shrink_node()")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reported-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jan Hadrava <had@kam.mff.cuni.cz>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org> [4.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
YueHaibing [Sat, 3 Aug 2019 04:48:40 +0000 (21:48 -0700)]
ocfs2: remove set but not used variable 'last_hash'
Fixes gcc '-Wunused-but-set-variable' warning:
fs/ocfs2/xattr.c: In function ocfs2_xattr_bucket_find:
fs/ocfs2/xattr.c:3828:6: warning: variable last_hash set but not used [-Wunused-but-set-variable]
It's never used and can be removed.
Link: http://lkml.kernel.org/r/20190716132110.34836-1-yuehaibing@huawei.com
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yang Shi [Sat, 3 Aug 2019 04:48:37 +0000 (21:48 -0700)]
Revert "kmemleak: allow to coexist with fault injection"
When running ltp's oom test with kmemleak enabled, the below warning was
triggerred since kernel detects __GFP_NOFAIL & ~__GFP_DIRECT_RECLAIM is
passed in:
WARNING: CPU: 105 PID: 2138 at mm/page_alloc.c:4608 __alloc_pages_nodemask+0x1c31/0x1d50
Modules linked in: loop dax_pmem dax_pmem_core ip_tables x_tables xfs virtio_net net_failover virtio_blk failover ata_generic virtio_pci virtio_ring virtio libata
CPU: 105 PID: 2138 Comm: oom01 Not tainted 5.2.0-next-
20190710+ #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:__alloc_pages_nodemask+0x1c31/0x1d50
...
kmemleak_alloc+0x4e/0xb0
kmem_cache_alloc+0x2a7/0x3e0
mempool_alloc_slab+0x2d/0x40
mempool_alloc+0x118/0x2b0
bio_alloc_bioset+0x19d/0x350
get_swap_bio+0x80/0x230
__swap_writepage+0x5ff/0xb20
The mempool_alloc_slab() clears __GFP_DIRECT_RECLAIM, however kmemleak
has __GFP_NOFAIL set all the time due to
d9570ee3bd1d4f2 ("kmemleak:
allow to coexist with fault injection"). But, it doesn't make any sense
to have __GFP_NOFAIL and ~__GFP_DIRECT_RECLAIM specified at the same
time.
According to the discussion on the mailing list, the commit should be
reverted for short term solution. Catalin Marinas would follow up with
a better solution for longer term.
The failure rate of kmemleak metadata allocation may increase in some
circumstances, but this should be expected side effect.
Link: http://lkml.kernel.org/r/1563299431-111710-1-git-send-email-yang.shi@linux.alibaba.com
Fixes:
d9570ee3bd1d4f2 ("kmemleak: allow to coexist with fault injection")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mauro Carvalho Chehab [Sat, 3 Aug 2019 04:48:33 +0000 (21:48 -0700)]
kernel/signal.c: fix a kernel-doc markup
The kernel-doc parser doesn't handle expressions with %foo*. Instead,
when an asterisk should be part of a constant, it uses an alternative
notation: `foo*`.
Link: http://lkml.kernel.org/r/7f18c2e0b5e39e6b7eb55ddeb043b8b260b49f2d.1563361575.git.mchehab+samsung@kernel.org
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 3 Aug 2019 01:53:51 +0000 (18:53 -0700)]
Merge tag 'drm-fixes-2019-08-02-1' of git://anongit.freedesktop.org/drm/drm
Pull more drm fixes from Daniel Vetter:
"Dave sends his pull, everyone realizes they've been asleep at the
wheel and hits send on their own pulls :-/
Normally I'd just ignore these all because w/e for me and Dave. But
this time around the latecomers also included drm-intel-fixes, which
failed to send out a -fixes pull thus far for this release (screwed up
vacation coverage, despite that 2/3 maintainers were around ... they
all look appropriately guilty), and that really is overdue to get
landed.
And since I had to do a pull request anyway I pulled the other two
late ones too.
intel fixes (didn't have any ever since the main merge window pull):
- gvt fixes (2 cc: stable)
- fix gpu reset vs mm-shrinker vs wakeup fun (needed a few patches)
- two gem locking fixes (one cc: stable)
- pile of misc fixes all over with minor impact, 6 cc: stable, others
from this window
exynos:
- misc minor fixes
misc:
- some build/Kconfig fixes
- regression fix for vm scalability perf test which seems to mostly
exercise dmesg/console logging ...
- the vgem cache flush fix for arm64 broke the world on x86, so
that's reverted again
* tag 'drm-fixes-2019-08-02-1' of git://anongit.freedesktop.org/drm/drm: (42 commits)
Revert "drm/vgem: fix cache synchronization on arm/arm64"
drm/exynos: fix missing decrement of retry counter
drm/exynos: add CONFIG_MMU dependency
drm/exynos: remove redundant assignment to pointer 'node'
drm/exynos: using dev_get_drvdata directly
drm/bochs: Use shadow buffer for bochs framebuffer console
drm/fb-helper: Instanciate shadow FB if configured in device's mode_config
drm/fb-helper: Map DRM client buffer only when required
drm/client: Support unmapping of DRM client buffers
drm/i915: Only recover active engines
drm/i915: Add a wakeref getter for iff the wakeref is already active
drm/i915: Lift intel_engines_resume() to callers
drm/vgem: fix cache synchronization on arm/arm64
drm/i810: Use CONFIG_PREEMPTION
drm/bridge: tc358764: Fix build error
drm/bridge: lvds-encoder: Fix build error while CONFIG_DRM_KMS_HELPER=m
drm/i915/gvt: Adding ppgtt to GVT GEM context after shadow pdps settled.
drm/i915/gvt: grab runtime pm first for forcewake use
drm/i915/gvt: fix incorrect cache entry for guest page mapping
drm/i915/gvt: Checking workload's gma earlier
...
Linus Torvalds [Sat, 3 Aug 2019 01:40:49 +0000 (18:40 -0700)]
Merge tag 'selinux-pr-
20190801' of git://git./linux/kernel/git/pcmoore/selinux
Pull selinux fix from Paul Moore:
"One more small fix for a potential memory leak in an error path"
* tag 'selinux-pr-
20190801' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
selinux: fix memory leak in policydb_init()
Jean Delvare [Wed, 31 Jul 2019 08:07:06 +0000 (10:07 +0200)]
mtd: hyperbus: Add hardware dependency to AM654 driver
The hbmc-am654 driver is for the TI AM654, which is an ARM64 SoC, so
don't propose this driver on other architectures unless
build-testing.
Fixes:
b07079f1642c ("mtd: hyperbus: Add driver for TI's HyperBus memory controller")
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Vignesh Raghavendra [Fri, 19 Jul 2019 08:29:12 +0000 (13:59 +0530)]
mtd: hyperbus: Kconfig: Fix HBMC_AM654 dependencies
On x86_64, when CONFIG_OF is not disabled:
WARNING: unmet direct dependencies detected for MUX_MMIO
Depends on [n]: MULTIPLEXER [=y] && (OF [=n] || COMPILE_TEST [=n])
Selected by [y]:
- HBMC_AM654 [=y] && MTD [=y] && MTD_HYPERBUS [=y]
due to
config HBMC_AM654
tristate "HyperBus controller driver for AM65x SoC"
select MULTIPLEXER
select MUX_MMIO
Fix this by making HBMC_AM654 imply MUX_MMIO instead of select so
that dependencies are taken care of. MUX_MMIO is optional for
functioning of driver.
Fixes:
b07079f1642c ("mtd: hyperbus: Add driver for TI's HyperBus memory controller")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Marco Felsch [Tue, 30 Jul 2019 13:44:07 +0000 (15:44 +0200)]
mtd: rawnand: micron: handle on-die "ECC-off" devices correctly
Some devices are not supposed to support on-die ECC but experience
shows that internal ECC machinery can actually be enabled through the
"SET FEATURE (EFh)" command, even if a read of the "READ ID Parameter
Tables" returns that it is not.
Currently, the driver checks the "READ ID Parameter" field directly
after having enabled the feature. If the check fails it returns
immediately but leaves the ECC on. When using buggy chips like
MT29F2G08ABAGA and MT29F2G08ABBGA, all future read/program cycles will
go through the on-die ECC, confusing the host controller which is
supposed to be the one handling correction.
To address this in a common way we need to turn off the on-die ECC
directly after reading the "READ ID Parameter" and before checking the
"ECC status".
Cc: stable@vger.kernel.org
Fixes:
dbc44edbf833 ("mtd: rawnand: micron: Fix on-die ECC detection logic")
Signed-off-by: Marco Felsch <m.felsch@pengutronix.de>
Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Linus Torvalds [Fri, 2 Aug 2019 22:26:48 +0000 (15:26 -0700)]
Merge tag 'for-linus-5.3a-rc3-tag' of git://git./linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- a small cleanup
- a fix for a build error on ARM with some configs
- a fix of a patch for the Xen gntdev driver
- three patches for fixing a potential problem in the swiotlb-xen
driver which Konrad was fine with me carrying them through the Xen
tree
* tag 'for-linus-5.3a-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/swiotlb: remember having called xen_create_contiguous_region()
xen/swiotlb: simplify range_straddles_page_boundary()
xen/swiotlb: fix condition for calling xen_destroy_contiguous_region()
xen: avoid link error on ARM
xen/gntdev.c: Replace vm_map_pages() with vm_map_pages_zero()
xen/pciback: remove set but not used variable 'old_state'
Linus Torvalds [Fri, 2 Aug 2019 22:23:27 +0000 (15:23 -0700)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Update the compat layer to allow single-byte watchpoints on all
addresses (similar to the native support)
- arm_pmu: fix the restoration of the counters on the
CPU_PM_ENTER_FAILED path
- Fix build regression with vDSO and Makefile not stripping
CROSS_COMPILE_COMPAT
- Fix the CTR_EL0 (cache type register) sanitisation on heterogeneous
machines (e.g. big.LITTLE)
- Fix the interrupt controller priority mask value when pseudo-NMIs are
enabled
- arm64 kprobes fixes: recovering of the PSTATE.D flag in the
single-step exception handler, NOKPROBE annotations for
unwind_frame() and walk_stackframe(), remove unneeded
rcu_read_lock/unlock from debug handlers
- Several gcc fall-through warnings
- Unused variable warnings
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: Make debug exception handlers visible from RCU
arm64: kprobes: Recover pstate.D in single-step exception handler
arm64/mm: fix variable 'tag' set but not used
arm64/mm: fix variable 'pud' set but not used
arm64: Remove unneeded rcu_read_lock from debug handlers
arm64: unwind: Prohibit probing on return_address()
arm64: Lower priority mask for GIC_PRIO_IRQON
arm64/efi: fix variable 'si' set but not used
arm64: cpufeature: Fix feature comparison for CTR_EL0.{CWG,ERG}
arm64: vdso: Fix Makefile regression
arm64: module: Mark expected switch fall-through
arm64: smp: Mark expected switch fall-through
arm64: hw_breakpoint: Fix warnings about implicit fallthrough
drivers/perf: arm_pmu: Fix failure path in PM notifier
arm64: compat: Allow single-byte watchpoints on all addresses
Linus Torvalds [Fri, 2 Aug 2019 22:18:51 +0000 (15:18 -0700)]
Merge branch 'parisc-5.3-4' of git://git./linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:
"A few small fixes for the parisc architecture:
- Fix fall-through warnings in parisc math emu code
- Fix vmlinuz linking failure with debug-enabled kernels
- Fix a race condition in kernel live-patching code
- Add missing archclean Makefile target & defconfig adjustments"
* 'parisc-5.3-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Add archclean Makefile target
parisc: Strip debug info from kernel before creating compressed vmlinuz
parisc: Fix build of compressed kernel even with debug enabled
parisc: fix race condition in patching code
parisc: rename default_defconfig to defconfig
parisc: Fix fall-through warnings in fpudispatch.c
parisc: Mark expected switch fall-throughs in fault.c