platform/kernel/linux-rpi.git
3 years agonvmet: remove extra variable in smart log nsid
Chaitanya Kulkarni [Thu, 14 Jan 2021 01:33:52 +0000 (17:33 -0800)]
nvmet: remove extra variable in smart log nsid

We remove the extra local variable struct nvmet_ns in
nvmet_get_smart_log_nsid() since req already has ns member that can be
reused, this also eliminates the explicit call to nvmet_put_namespace()
which is already present in the request completion path.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme: refactor ns->ctrl by request
Minwoo Im [Wed, 13 Jan 2021 14:36:27 +0000 (23:36 +0900)]
nvme: refactor ns->ctrl by request

Just for current code in nvme_cleanup_cmd(), we don't have to get
namespace instance, but we need controller instance.

Controller instance can be retrieved by namespace instance, but it can
be directly accessed by nvme_request instance from request.

ctrl = nvme_req(req)->ctrl;

We don't have to go around namespace instance from request instance
through gendisk.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme-tcp: pass multipage bvec to request iov_iter
Sagi Grimberg [Thu, 14 Jan 2021 21:15:26 +0000 (13:15 -0800)]
nvme-tcp: pass multipage bvec to request iov_iter

iov_iter uses the right helpers so we should be able
to pass in a multipage bvec. Right now the iov_iter is
initialized with more segments that it needs which doesn't
fail because the iov_iter is capped by byte count, but it
is better to use a full multipage bvec iter.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme-tcp: get rid of unused helper function
Sagi Grimberg [Thu, 14 Jan 2021 21:15:25 +0000 (13:15 -0800)]
nvme-tcp: get rid of unused helper function

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme-tcp: fix wrong setting of request iov_iter
Sagi Grimberg [Thu, 14 Jan 2021 21:15:24 +0000 (13:15 -0800)]
nvme-tcp: fix wrong setting of request iov_iter

We might set the iov_iter direction wrong, which is harmless for this
use-case, but get it right. Also this makes the code slightly cleaner.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme: support command retry delay for admin command
Minwoo Im [Fri, 8 Jan 2021 14:46:57 +0000 (23:46 +0900)]
nvme: support command retry delay for admin command

The controller can request a delay retrying a failed command by setting
the Command Retry Delay (CRD) field in the Completion Queue Entry.

Currentlty this features is only applied to commands on the I/O queue, but
not to commands on the admin queue.  Retreive the nvme_ctrl from the
request so that no namespace is required and apply the feature to all
commands.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme: constify static attribute_group structs
Rikard Falkeborn [Fri, 8 Jan 2021 23:41:47 +0000 (00:41 +0100)]
nvme: constify static attribute_group structs

The only usage of these is to put their addresses in arrays of pointers
to const attribute_groups. Make them const to allow the compiler to put
them in read-only memory.

Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvmet-fc: use RCU proctection for assoc_list
Leonid Ravich [Sun, 3 Jan 2021 18:12:54 +0000 (20:12 +0200)]
nvmet-fc: use RCU proctection for assoc_list

searching assoc_list protected by rcu_read_lock if list not changed inline.
and according to the rcu list rules.

queue array embedded into nvmet_fc_tgt_assoc protected by rcu_read_lock
according to rcu dereference/assign rules.

queue and assoc object freed after grace period by call_rcu.

tgtport lock taken for changing assoc_list.

Reviewed-by: Eldad Zinger <Eldad.Zinger@dell.com>
Reviewed-by: Elad Grupi <Elad.Grupi@dell.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Leonid Ravich <Leonid.Ravich@emc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvmet: Fix nvmet_is_port_enabled indentation
Israel Rukshin [Thu, 7 Jan 2021 15:34:14 +0000 (17:34 +0200)]
nvmet: Fix nvmet_is_port_enabled indentation

Remove extra tab.

Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvmet: Use nvmet_is_port_enabled helper for pi_enable
Israel Rukshin [Thu, 7 Jan 2021 15:34:13 +0000 (17:34 +0200)]
nvmet: Use nvmet_is_port_enabled helper for pi_enable

Remove code duplication.

Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agodrbd: Avoid comma separated statements
Joe Perches [Tue, 25 Aug 2020 04:56:03 +0000 (21:56 -0700)]
drbd: Avoid comma separated statements

Use semicolons and braces.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agorsxx: remove redundant NULL check
Yang Li [Thu, 21 Jan 2021 09:43:22 +0000 (17:43 +0800)]
rsxx: remove redundant NULL check

Fix below warnings reported by coccicheck:
./drivers/block/rsxx/dma.c:948:3-8: WARNING: NULL check
before some freeing functions is not needed.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <abaci-bugfix@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agozram: fix NULL check before some freeing functions is not needed
Tian Tao [Mon, 25 Jan 2021 08:13:01 +0000 (16:13 +0800)]
zram: fix NULL check before some freeing functions is not needed

fixed the below warning:
/drivers/block/zram/zram_drv.c:534:2-8: WARNING: NULL check
before some freeing functions is not needed.

Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agodrbd: remove unused argument from drbd_request_prepare and __drbd_make_request
Guoqing Jiang [Thu, 21 Jan 2021 14:21:50 +0000 (15:21 +0100)]
drbd: remove unused argument from drbd_request_prepare and __drbd_make_request

We can remove start_jif since it is not used by drbd_request_prepare,
then remove it from __drbd_make_request further.

Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: drbd-dev@lists.linbit.com
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agomtip32xx: prefer pcie_capability_read_word()
Bjorn Helgaas [Tue, 26 Jan 2021 20:04:33 +0000 (14:04 -0600)]
mtip32xx: prefer pcie_capability_read_word()

Replace pci_read_config_word() with pcie_capability_read_word().

pcie_capability_read_word() takes care of a few special cases when reading
the PCIe capability.  See 8c0d3a02c130 ("PCI: Add accessors for PCI Express
Capability").

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agomtip32xx: use PCI #defines instead of numbers
Bjorn Helgaas [Tue, 26 Jan 2021 20:04:32 +0000 (14:04 -0600)]
mtip32xx: use PCI #defines instead of numbers

Use PCI #defines for PCIe Device Control register values instead of
hard-coding bit positions.  No functional change intended.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoloop: scale loop device by introducing per device lock
Pavel Tatashin [Tue, 26 Jan 2021 14:46:30 +0000 (09:46 -0500)]
loop: scale loop device by introducing per device lock

Currently, loop device has only one global lock: loop_ctl_mutex.

This becomes hot in scenarios where many loop devices are used.

Scale it by introducing per-device lock: lo_mutex that protects
modifications of all fields in struct loop_device.

Keep loop_ctl_mutex to protect global data: loop_index_idr, loop_lookup,
loop_add.

The new lock ordering requirement is that loop_ctl_mutex must be taken
before lo_mutex.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Tyler Hicks <tyhicks@linux.microsoft.com>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agobdev: Do not return EBUSY if bdev discard races with write
Jan Kara [Thu, 7 Jan 2021 15:40:34 +0000 (16:40 +0100)]
bdev: Do not return EBUSY if bdev discard races with write

blkdev_fallocate() tries to detect whether a discard raced with an
overlapping write by calling invalidate_inode_pages2_range(). However
this check can give both false negatives (when writing using direct IO
or when writeback already writes out the written pagecache range) and
false positives (when write is not actually overlapping but ends in the
same page when blocksize < pagesize). This actually causes issues for
qemu which is getting confused by EBUSY errors.

Fix the problem by removing this conflicting write detection since it is
inherently racy and thus of little use anyway.

Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
CC: "Darrick J. Wong" <darrick.wong@oracle.com>
Link: https://lore.kernel.org/qemu-devel/20201111153913.41840-1-mlevitsk@redhat.com
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: inherit BIO_REMAPPED when cloning bios
Christoph Hellwig [Tue, 26 Jan 2021 14:33:08 +0000 (15:33 +0100)]
block: inherit BIO_REMAPPED when cloning bios

Cloned bios are can be used to on the same device, in which case we need
to inherit the BIO_REMAPPED flag to avoid a double partition remap.  When
the cloned bios are used on another device, bio_set_dev will clear the flag.

Fixes: 309dca309fc3 ("block: store a block_device pointer in struct bio")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agobcache: use bio_set_dev to assign ->bi_bdev
Christoph Hellwig [Tue, 26 Jan 2021 14:33:07 +0000 (15:33 +0100)]
bcache: use bio_set_dev to assign ->bi_bdev

Always use the bio_set_dev helper to assign ->bi_bdev to make sure
other state related to the device is uptodate.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonvme: use bio_set_dev to assign ->bi_bdev
Christoph Hellwig [Tue, 26 Jan 2021 14:33:06 +0000 (15:33 +0100)]
nvme: use bio_set_dev to assign ->bi_bdev

Always use the bio_set_dev helper to assign ->bi_bdev to make sure
other state related to the device is uptodate.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agobfq: bfq_check_waker() should be static
Jens Axboe [Tue, 26 Jan 2021 04:15:01 +0000 (21:15 -0700)]
bfq: bfq_check_waker() should be static

It's only used in the same file, mark is appropriately static.

Fixes: 71217df39dc6 ("block, bfq: make waker-queue detection more robust")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock, bfq: make waker-queue detection more robust
Paolo Valente [Mon, 25 Jan 2021 19:02:48 +0000 (20:02 +0100)]
block, bfq: make waker-queue detection more robust

In the presence of many parallel I/O flows, the detection of waker
bfq_queues suffers from false positives. This commits addresses this
issue by making the filtering of actual wakers more selective. In more
detail, a candidate waker must be found to meet waker requirements
three times before being promoted to actual waker.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock, bfq: save also injection state on queue merging
Paolo Valente [Mon, 25 Jan 2021 19:02:47 +0000 (20:02 +0100)]
block, bfq: save also injection state on queue merging

To prevent injection information from being lost on bfq_queue merging,
also the amount of service that a bfq_queue receives must be saved and
restored when the bfq_queue is merged and split, respectively.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock, bfq: save also weight-raised service on queue merging
Paolo Valente [Mon, 25 Jan 2021 19:02:46 +0000 (20:02 +0100)]
block, bfq: save also weight-raised service on queue merging

To prevent weight-raising information from being lost on bfq_queue merging,
also the amount of service that a bfq_queue receives must be saved and
restored when the bfq_queue is merged and split, respectively.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock, bfq: fix switch back from soft-rt weitgh-raising
Paolo Valente [Mon, 25 Jan 2021 19:02:45 +0000 (20:02 +0100)]
block, bfq: fix switch back from soft-rt weitgh-raising

A bfq_queue may happen to be deemed as soft real-time while it is
still enjoying interactive weight-raising. If this happens because of
a false positive, then the bfq_queue is likely to loose its soft
real-time status soon. Upon losing such a status, the bfq_queue must
get back its interactive weight-raising, if its interactive period is
not over yet. But this case is not handled. This commit corrects this
error.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock, bfq: re-evaluate convenience of I/O plugging on rq arrivals
Paolo Valente [Mon, 25 Jan 2021 19:02:44 +0000 (20:02 +0100)]
block, bfq: re-evaluate convenience of I/O plugging on rq arrivals

Upon an I/O-dispatch attempt, BFQ may detect that it was better to
plug I/O dispatch, and to wait for a new request to arrive for the
currently in-service queue. But the arrival of a new request for an
empty bfq_queue, and thus the switch from idle to busy of the
bfq_queue, may cause the scenario to change, and make plugging no
longer needed for service guarantees, or more convenient for
throughput. In this case, keeping I/O-dispatch plugged would certainly
lower throughput.

To address this issue, this commit makes such a check, and stops
plugging I/O if it is better to stop plugging I/O.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock, bfq: replace mechanism for evaluating I/O intensity
Paolo Valente [Mon, 25 Jan 2021 19:02:43 +0000 (20:02 +0100)]
block, bfq: replace mechanism for evaluating I/O intensity

Some BFQ mechanisms make their decisions on a bfq_queue basing also on
whether the bfq_queue is I/O bound. In this respect, the current logic
for evaluating whether a bfq_queue is I/O bound is rather rough. This
commits replaces this logic with a more effective one.

The new logic measures the percentage of time during which a bfq_queue
is active, and marks the bfq_queue as I/O bound if the latter if this
percentage is above a fixed threshold.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: skip bio_check_eod for partition-remapped bios
Christoph Hellwig [Mon, 25 Jan 2021 18:39:57 +0000 (19:39 +0100)]
block: skip bio_check_eod for partition-remapped bios

When an already remapped bio is resubmitted (e.g. by blk_queue_split),
bio_check_eod will compare the remapped bi_sector against the size
of the partition, leading to spurious I/O failures.

Skip the EOD check in this case.

Fixes: 309dca309fc3 ("block: store a block_device pointer in struct bio")
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agobio: don't copy bvec for direct IO
Pavel Begunkov [Sat, 9 Jan 2021 16:03:03 +0000 (16:03 +0000)]
bio: don't copy bvec for direct IO

The block layer spends quite a while in blkdev_direct_IO() to copy and
initialise bio's bvec. However, if we've already got a bvec in the input
iterator it might be reused in some cases, i.e. when new
ITER_BVEC_FLAG_FIXED flag is set. Simple tests show considerable
performance boost, and it also reduces memory footprint.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agobio: add a helper calculating nr segments to alloc
Pavel Begunkov [Sat, 9 Jan 2021 16:03:02 +0000 (16:03 +0000)]
bio: add a helper calculating nr segments to alloc

Add a helper function calculating the number of bvec segments we need to
allocate to construct a bio. It doesn't change anything functionally,
but will be used to not duplicate special cases in the future.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoiov_iter: optimise bvec iov_iter_advance()
Pavel Begunkov [Sat, 9 Jan 2021 16:03:01 +0000 (16:03 +0000)]
iov_iter: optimise bvec iov_iter_advance()

iov_iter_advance() is heavily used, but implemented through generic
means. For bvecs there is a specifically crafted function for that, so
use bvec_iter_advance() instead, it's faster and slimmer.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agotarget/file: allocate the bvec array as part of struct target_core_file_cmd
Christoph Hellwig [Sat, 9 Jan 2021 16:03:00 +0000 (16:03 +0000)]
target/file: allocate the bvec array as part of struct target_core_file_cmd

This saves one memory allocation, and ensures the bvecs aren't freed
before the AIO completion.  This will allow the lower level code to be
optimized so that it can avoid allocating another bvec array.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock/psi: remove PSI annotations from direct IO
Pavel Begunkov [Sat, 9 Jan 2021 16:02:59 +0000 (16:02 +0000)]
block/psi: remove PSI annotations from direct IO

Direct IO does not operate on the current working set of pages managed
by the kernel, so it should not be accounted as memory stall to PSI
infrastructure.

The block layer and iomap direct IO use bio_iov_iter_get_pages()
to build bios, and they are the only users of it, so to avoid PSI
tracking for them clear out BIO_WORKINGSET flag. Do same for
dio_bio_submit() because fs/direct_io constructs bios by hand directly
calling bio_add_page().

Reported-by: Christoph Hellwig <hch@infradead.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agobvec/iter: disallow zero-length segment bvecs
Pavel Begunkov [Sat, 9 Jan 2021 16:02:58 +0000 (16:02 +0000)]
bvec/iter: disallow zero-length segment bvecs

zero-length bvec segments are allowed in general, but not handled by bio
and down the block layer so filtered out. This inconsistency may be
confusing and prevent from optimisations. As zero-length segments are
useless and places that were generating them are patched, declare them
not allowed.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agosplice: don't generate zero-len segement bvecs
Pavel Begunkov [Sat, 9 Jan 2021 16:02:57 +0000 (16:02 +0000)]
splice: don't generate zero-len segement bvecs

iter_file_splice_write() may spawn bvec segments with zero-length. In
preparation for prohibiting them, filter out by hand at splice level.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: remove unnecessary argument from blk_execute_rq
Guoqing Jiang [Mon, 25 Jan 2021 04:49:58 +0000 (05:49 +0100)]
block: remove unnecessary argument from blk_execute_rq

We can remove 'q' from blk_execute_rq as well after the previous change
in blk_execute_rq_nowait.

And more importantly it never really was needed to start with given
that we can trivial derive it from struct request.

Cc: linux-scsi@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: linux-ide@vger.kernel.org
Cc: linux-mmc@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: linux-nfs@vger.kernel.org
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: remove unnecessary argument from blk_execute_rq_nowait
Guoqing Jiang [Mon, 25 Jan 2021 04:49:57 +0000 (05:49 +0100)]
block: remove unnecessary argument from blk_execute_rq_nowait

The 'q' is not used since commit a1ce35fa4985 ("block: remove dead
elevator code"), also update the comment of the function.

And more importantly it never really was needed to start with given
that we can trivial derive it from struct request.

Cc: target-devel@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: linux-ide@vger.kernel.org
Cc: linux-mmc@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agobsg: free the request before return error code
Pan Bian [Tue, 19 Jan 2021 12:33:11 +0000 (04:33 -0800)]
bsg: free the request before return error code

Free the request rq before returning error code.

Fixes: 972248e9111e ("scsi: bsg-lib: handle bidi requests without block layer help")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agobcache: don't pass BIOSET_NEED_BVECS for the 'bio_set' embedded in 'cache_set'
Ming Lei [Mon, 11 Jan 2021 03:05:57 +0000 (11:05 +0800)]
bcache: don't pass BIOSET_NEED_BVECS for the 'bio_set' embedded in 'cache_set'

This bioset is just for allocating bio only from bio_next_split, and it
needn't bvecs, so remove the flag.

Cc: linux-bcache@vger.kernel.org
Cc: Coly Li <colyli@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: move three bvec helpers declaration into private helper
Ming Lei [Mon, 11 Jan 2021 03:05:56 +0000 (11:05 +0800)]
block: move three bvec helpers declaration into private helper

bvec_alloc(), bvec_free() and bvec_nr_vecs() are only used inside block
layer core functions, no need to declare them in public header.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: set .bi_max_vecs as actual allocated vector number
Ming Lei [Mon, 11 Jan 2021 03:05:55 +0000 (11:05 +0800)]
block: set .bi_max_vecs as actual allocated vector number

bvec_alloc() may allocate more bio vectors than requested, so set
.bi_max_vecs as actual allocated vector number, instead of the requested
number. This way can help fs build bigger bio because new bio often won't
be allocated until the current one becomes full.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: don't allocate inline bvecs if this bioset needn't bvecs
Ming Lei [Mon, 11 Jan 2021 03:05:54 +0000 (11:05 +0800)]
block: don't allocate inline bvecs if this bioset needn't bvecs

The inline bvecs won't be used if user needn't bvecs by not passing
BIOSET_NEED_BVECS, so don't allocate bvecs in this situation.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: don't pass BIOSET_NEED_BVECS for q->bio_split
Ming Lei [Mon, 11 Jan 2021 03:05:53 +0000 (11:05 +0800)]
block: don't pass BIOSET_NEED_BVECS for q->bio_split

q->bio_split is only used by bio_split() for fast cloning bio, and no
need to allocate bvecs, so remove this flag.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: manage bio slab cache by xarray
Ming Lei [Mon, 11 Jan 2021 03:05:52 +0000 (11:05 +0800)]
block: manage bio slab cache by xarray

Managing bio slab cache via xarray by using slab cache size as xarray
index, and storing 'struct bio_slab' instance into xarray.

So code is simplified a lot, meantime it becomes more readable than before.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agobfq: don't duplicate code for different paths
huhai [Fri, 25 Dec 2020 13:00:16 +0000 (21:00 +0800)]
bfq: don't duplicate code for different paths

As we can see, returns parent_sched_may_change whether
sd->next_in_service changes or not, so remove this judgment.

Signed-off-by: huhai <huhai@tj.kylinos.cn>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-mq: Improve performance of non-mq IO schedulers with multiple HW queues
Jan Kara [Mon, 11 Jan 2021 16:47:17 +0000 (17:47 +0100)]
blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues

Currently when non-mq aware IO scheduler (BFQ, mq-deadline) is used for
a queue with multiple HW queues, the performance it rather bad. The
problem is that these IO schedulers use queue-wide locking and their
dispatch function does not respect the hctx it is passed in and returns
any request it finds appropriate. Thus locality of request access is
broken and dispatch from multiple CPUs just contends on IO scheduler
locks. For these IO schedulers there's little point in dispatching from
multiple CPUs. Instead dispatch always only from a single CPU to limit
contention.

Below is a comparison of dbench runs on XFS filesystem where the storage
is a raid card with 64 HW queues and to it attached a single rotating
disk. BFQ is used as IO scheduler:

      clients           MQ                     SQ             MQ-Patched
Amean 1      39.12 (0.00%)       43.29 * -10.67%*       36.09 *   7.74%*
Amean 2     128.58 (0.00%)      101.30 *  21.22%*       96.14 *  25.23%*
Amean 4     577.42 (0.00%)      494.47 *  14.37%*      508.49 *  11.94%*
Amean 8     610.95 (0.00%)      363.86 *  40.44%*      362.12 *  40.73%*
Amean 16    391.78 (0.00%)      261.49 *  33.25%*      282.94 *  27.78%*
Amean 32    324.64 (0.00%)      267.71 *  17.54%*      233.00 *  28.23%*
Amean 64    295.04 (0.00%)      253.02 *  14.24%*      242.37 *  17.85%*
Amean 512 10281.61 (0.00%)    10211.16 *   0.69%*    10447.53 *  -1.61%*

Numbers are times so lower is better. MQ is stock 5.10-rc6 kernel. SQ is
the same kernel with megaraid_sas.host_tagset_enable=0 so that the card
advertises just a single HW queue. MQ-Patched is a kernel with this
patch applied.

You can see multiple hardware queues heavily hurt performance in
combination with BFQ. The patch restores the performance.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoRevert "blk-mq, elevator: Count requests per hctx to improve performance"
Jan Kara [Mon, 11 Jan 2021 16:47:16 +0000 (17:47 +0100)]
Revert "blk-mq, elevator: Count requests per hctx to improve performance"

This reverts commit b445547ec1bbd3e7bf4b1c142550942f70527d95.

Since both mq-deadline and BFQ completely ignore hctx they are passed to
their dispatch function and dispatch whatever request they deem fit
checking whether any request for a particular hctx is queued is just
pointless since we'll very likely get a request from a different hctx
anyway. In the following commit we'll deal with lock contention in these
IO schedulers in presence of multiple HW queues in a different way.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock, bfq: do not expire a queue when it is the only busy one
Paolo Valente [Fri, 22 Jan 2021 18:19:48 +0000 (19:19 +0100)]
block, bfq: do not expire a queue when it is the only busy one

This commits preserves I/O-dispatch plugging for a special symmetric
case that may suddenly turn into asymmetric: the case where only one
bfq_queue, say bfqq, is busy. In this case, not expiring bfqq does not
cause any harm to any other queues in terms of service guarantees. In
contrast, it avoids the following unlucky sequence of events: (1) bfqq
is expired, (2) a new queue with a lower weight than bfqq becomes busy
(or more queues), (3) the new queue is served until a new request
arrives for bfqq, (4) when bfqq is finally served, there are so many
requests of the new queue in the drive that the pending requests for
bfqq take a lot of time to be served. In particular, event (2) may
case even already dispatched requests of bfqq to be delayed, inside
the drive. So, to avoid this series of events, the scenario is
preventively declared as asymmetric also if bfqq is the only busy
queues. By doing so, I/O-dispatch plugging is performed for bfqq.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock, bfq: avoid spurious switches to soft_rt of interactive queues
Paolo Valente [Fri, 22 Jan 2021 18:19:47 +0000 (19:19 +0100)]
block, bfq: avoid spurious switches to soft_rt of interactive queues

BFQ tags some bfq_queues as interactive or soft_rt if it deems that
these bfq_queues contain the I/O of, respectively, interactive or soft
real-time applications. BFQ privileges both these special types of
bfq_queues over normal bfq_queues. To privilege a bfq_queue, BFQ
mainly raises the weight of the bfq_queue. In particular, soft_rt
bfq_queues get a higher weight than interactive bfq_queues.

A bfq_queue may turn from interactive to soft_rt. And this leads to a
tricky issue. Soft real-time applications usually start with an
I/O-bound, interactive phase, in which they load themselves into main
memory. BFQ correctly detects this phase, and keeps the bfq_queues
associated with the application in interactive mode for a
while. Problems arise when the I/O pattern of the application finally
switches to soft real-time. One of the conditions for a bfq_queue to
be deemed as soft_rt is that the bfq_queue does not consume too much
bandwidth. But the bfq_queues associated with a soft real-time
application consume as much bandwidth as they can in the loading phase
of the application. So, after the application becomes truly soft
real-time, a lot of time should pass before the average bandwidth
consumed by its bfq_queues finally drops to a value acceptable for
soft_rt bfq_queues. As a consequence, there might be a time gap during
which the application is not privileged at all, because its bfq_queues
are not interactive any longer, but cannot be deemed as soft_rt yet.

To avoid this problem, BFQ pretends that an interactive bfq_queue
consumes zero bandwidth, and allows an interactive bfq_queue to switch
to soft_rt. Yet, this fake zero-bandwidth consumption easily causes
the bfq_queue to often switch to soft_rt deceptively, during its
loading phase. As in soft_rt mode, the bfq_queue gets its bandwidth
correctly computed, and therefore soon switches back to
interactive. Then it switches again to soft_rt, and so on. These
spurious fluctuations usually cause losses of throughput, because they
deceive BFQ's mechanisms for boosting throughput (injection,
I/O-plugging avoidance, ...).

This commit addresses this issue as follows:
1) It does compute actual bandwidth consumption also for interactive
   bfq_queues. This avoids the above false positives.
2) When a bfq_queue switches from interactive to normal mode, the
   consumed bandwidth is reset (forgotten). This allows the
   bfq_queue to enjoy soft_rt very quickly. In particular, two
   alternatives are possible in this switch:
    - the bfq_queue still has backlog, and therefore there is a budget
      already scheduled to serve the bfq_queue; in this case, the
      scheduling of the current budget of the bfq_queue is not
      hindered, because only the scheduling of the next budget will
      be affected by the weight drop. After that, if the bfq_queue is
      actually in a soft_rt phase, and becomes empty during the
      service of its current budget, which is the natural behavior of
      a soft_rt bfq_queue, then the bfq_queue will be considered as
      soft_rt when its next I/O arrives. If, in contrast, the
      bfq_queue remains constantly non-empty, then its next budget
      will be scheduled with a low weight, which is the natural
      treatment for an I/O-bound (non soft_rt) bfq_queue.
    - the bfq_queue is empty; in this case, the bfq_queue may be
      considered unjustly soft_rt when its new I/O arrives. Yet
      the problem is now much smaller than before, because it is
      unlikely that more than one spurious fluctuation occurs.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock, bfq: do not raise non-default weights
Paolo Valente [Fri, 22 Jan 2021 18:19:46 +0000 (19:19 +0100)]
block, bfq: do not raise non-default weights

BFQ heuristics try to detect interactive I/O, and raise the weight of
the queues containing such an I/O. Yet, if also the user changes the
weight of a queue (i.e., the user changes the ioprio of the process
associated with that queue), then it is most likely better to prevent
BFQ heuristics from silently changing the same weight.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock, bfq: increase time window for waker detection
Paolo Valente [Fri, 22 Jan 2021 18:19:45 +0000 (19:19 +0100)]
block, bfq: increase time window for waker detection

Tests on slower machines showed current window to be way too
small. This commit increases it.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock, bfq: set next_rq to waker_bfqq->next_rq in waker injection
Jia Cheng Hu [Fri, 22 Jan 2021 18:19:44 +0000 (19:19 +0100)]
block, bfq: set next_rq to waker_bfqq->next_rq in waker injection

Since commit c5089591c3ba ("block, bfq: detect wakers and
unconditionally inject their I/O"), when the in-service bfq_queue, say
Q, is temporarily empty, BFQ checks whether there are I/O requests to
inject (also) from the waker bfq_queue for Q. To this goal, the value
pointed by bfqq->waker_bfqq->next_rq must be controlled. However, the
current implementation mistakenly looks at bfqq->next_rq, which
instead points to the next request of the currently served queue.

This mistake evidently causes losses of throughput in scenarios with
waker bfq_queues.

This commit corrects this mistake.

Fixes: c5089591c3ba ("block, bfq: detect wakers and unconditionally inject their I/O")
Signed-off-by: Jia Cheng Hu <jia.jiachenghu@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock, bfq: use half slice_idle as a threshold to check short ttime
Paolo Valente [Fri, 22 Jan 2021 18:19:43 +0000 (19:19 +0100)]
block, bfq: use half slice_idle as a threshold to check short ttime

The value of the I/O plugging (idling) timeout is used also as the
think-time threshold to decide whether a process has a short think
time.  In this respect, a good value of this timeout for rotational
drives is un the order of several ms. Yet, this is often too long a
time interval to be effective as a think-time threshold. This commit
mitigates this problem (by a lot, according to tests), by halving the
threshold.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: use an xarray for disk->part_tbl
Christoph Hellwig [Sun, 24 Jan 2021 10:02:41 +0000 (11:02 +0100)]
block: use an xarray for disk->part_tbl

Now that no fast path lookups in the partition table are left, there is
no point in micro-optimizing the data structure for it.  Just use a bog
standard xarray.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: remove DISK_PITER_REVERSE
Christoph Hellwig [Sun, 24 Jan 2021 10:02:40 +0000 (11:02 +0100)]
block: remove DISK_PITER_REVERSE

There is good reason to iterate backwards when deleting all partitions in
del_gendisk, just like we don't in blk_drop_partitions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: add a disk_uevent helper
Christoph Hellwig [Sun, 24 Jan 2021 10:02:39 +0000 (11:02 +0100)]
block: add a disk_uevent helper

Add a helper to call kobject_uevent for the disk and all partitions, and
unexport the disk_part_iter_* helpers that are now only used in the core
block code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-mq: use ->bi_bdev for I/O accounting
Christoph Hellwig [Sun, 24 Jan 2021 10:02:38 +0000 (11:02 +0100)]
blk-mq: use ->bi_bdev for I/O accounting

Remove the reverse map from a sector to a partition for I/O accounting by
simply using ->bi_bdev.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: use ->bi_bdev for bio based I/O accounting
Christoph Hellwig [Sun, 24 Jan 2021 10:02:37 +0000 (11:02 +0100)]
block: use ->bi_bdev for bio based I/O accounting

Rework the I/O accounting for bio based drivers to use ->bi_bdev.  This
means all drivers can now simply use bio_start_io_acct to start
accounting, and it will take partitions into account automatically.  To
end I/O account either bio_end_io_acct can be used if the driver never
remaps I/O to a different device, or bio_end_io_acct_remapped if the
driver did remap the I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: do not reassig ->bi_bdev when partition remapping
Christoph Hellwig [Sun, 24 Jan 2021 10:02:36 +0000 (11:02 +0100)]
block: do not reassig ->bi_bdev when partition remapping

There is no good reason to reassign ->bi_bdev when remapping the
partition-relative block number to the device wide one, as all the
information required by the drivers comes from the gendisk anyway.

Keeping the original ->bi_bdev alive will allow to greatly simplify
the partition-away I/O accounting.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: simplify submit_bio_checks a bit
Christoph Hellwig [Sun, 24 Jan 2021 10:02:35 +0000 (11:02 +0100)]
block: simplify submit_bio_checks a bit

Merge a few checks for whole devices vs partitions to streamline the
sanity checks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: store a block_device pointer in struct bio
Christoph Hellwig [Sun, 24 Jan 2021 10:02:34 +0000 (11:02 +0100)]
block: store a block_device pointer in struct bio

Replace the gendisk pointer in struct bio with a pointer to the newly
improved struct block device.  From that the gendisk can be trivially
accessed with an extra indirection, but it also allows to directly
look up all information related to partition remapping.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agodcssblk: remove the end of device check in dcssblk_submit_bio
Christoph Hellwig [Sun, 24 Jan 2021 10:02:33 +0000 (11:02 +0100)]
dcssblk: remove the end of device check in dcssblk_submit_bio

The block layer already checks for this conditions in bio_check_eod
before calling the driver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agobrd: remove the end of device check in brd_do_bvec
Christoph Hellwig [Sun, 24 Jan 2021 10:02:32 +0000 (11:02 +0100)]
brd: remove the end of device check in brd_do_bvec

The block layer already checks for this conditions in bio_check_eod
before calling the driver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonvme: allow revalidate to set a namespace read-only
Christoph Hellwig [Sat, 9 Jan 2021 10:42:54 +0000 (11:42 +0100)]
nvme: allow revalidate to set a namespace read-only

Unconditionally call set_disk_ro now that it only updates the hardware
state.  This allows to properly set up the Linux devices read-only when
the controller turns a previously writable namespace read-only.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agorbd: remove the ->set_read_only method
Christoph Hellwig [Sat, 9 Jan 2021 10:42:53 +0000 (11:42 +0100)]
rbd: remove the ->set_read_only method

Now that the hardware read-only state can't be changed by the BLKROSET
ioctl, the code in this method is not required anymore.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: propagate BLKROSET on the whole device to all partitions
Christoph Hellwig [Sat, 9 Jan 2021 10:42:52 +0000 (11:42 +0100)]
block: propagate BLKROSET on the whole device to all partitions

Change the policy so that a BLKROSET on the whole device also affects
partitions.  To quote Martin K. Petersen:

It's very common for database folks to twiddle the read-only state of
block devices and partitions. I know that our users will find it very
counter-intuitive that setting /dev/sda read-only won't prevent writes
to /dev/sda1.

The existing behavior is inconsistent in the sense that doing:

  # blockdev --setro /dev/sda
  # echo foo > /dev/sda1

permits writes. But:

  # blockdev --setro /dev/sda
  <something triggers revalidate>
  # echo foo > /dev/sda1

doesn't.

And a subsequent:

  # blockdev --setrw /dev/sda
  # echo foo > /dev/sda1

doesn't work either since sda1's read-only policy has been inherited
from the whole-disk device.

You need to do:

  # blockdev --rereadpt

after setting the whole-disk device rw to effectuate the same change on
the partitions, otherwise they are stuck being read-only indefinitely.

However, setting the read-only policy on a partition does *not* require
the revalidate step. As a matter of fact, doing the revalidate will blow
away the policy setting you just made.

So the user needs to take different actions depending on whether they
are trying to read-protect a whole-disk device or a partition. Despite
using the same ioctl. That is really confusing.

I have lost count how many times our customers have had data clobbered
because of ambiguity of the existing whole-disk device policy. The
current behavior violates the principle of least surprise by letting the
user think they write protected the whole disk when they actually
didn't.

Suggested-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: add a hard-readonly flag to struct gendisk
Christoph Hellwig [Sat, 9 Jan 2021 10:42:51 +0000 (11:42 +0100)]
block: add a hard-readonly flag to struct gendisk

Commit 20bd1d026aac ("scsi: sd: Keep disk read-only when re-reading
partition") addressed a long-standing problem with user read-only
policy being overridden as a result of a device-initiated revalidate.
The commit has since been reverted due to a regression that left some
USB devices read-only indefinitely.

To fix the underlying problems with revalidate we need to keep track
of hardware state and user policy separately.

The gendisk has been updated to reflect the current hardware state set
by the device driver. This is done to allow returning the device to
the hardware state once the user clears the BLKROSET flag.

The resulting semantics are as follows:

 - If BLKROSET sets a given partition read-only, that partition will
   remain read-only even if the underlying storage stack initiates a
   revalidate. However, the BLKRRPART ioctl will cause the partition
   table to be dropped and any user policy on partitions will be lost.

 - If BLKROSET has not been set, both the whole disk device and any
   partitions will reflect the current write-protect state of the
   underlying device.

Based on a patch from Martin K. Petersen <martin.petersen@oracle.com>.

Reported-by: Oleksii Kurochko <olkuroch@cisco.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201221
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: remove the NULL bdev check in bdev_read_only
Christoph Hellwig [Sat, 9 Jan 2021 10:42:50 +0000 (11:42 +0100)]
block: remove the NULL bdev check in bdev_read_only

Only a single caller can end up in bdev_read_only, so move the check
there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agodm: use bdev_read_only to check if a device is read-only
Christoph Hellwig [Sat, 9 Jan 2021 10:42:49 +0000 (11:42 +0100)]
dm: use bdev_read_only to check if a device is read-only

dm-thin and dm-cache also work on partitions, so use the proper
interface to check if the device is read-only.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoLinux 5.11-rc5
Linus Torvalds [Mon, 25 Jan 2021 00:47:14 +0000 (16:47 -0800)]
Linux 5.11-rc5

3 years agoMerge tag 'sh-for-5.11' of git://git.libc.org/linux-sh
Linus Torvalds [Sun, 24 Jan 2021 21:52:02 +0000 (13:52 -0800)]
Merge tag 'sh-for-5.11' of git://git.libc.org/linux-sh

Pull arch/sh updates from Rich Felker:
 "Cleanup and warning fixes"

* tag 'sh-for-5.11' of git://git.libc.org/linux-sh:
  sh/intc: Restore devm_ioremap() alignment
  sh: mach-sh03: remove duplicate include
  arch: sh: remove duplicate include
  sh: Drop ARCH_NR_GPIOS definition
  sh: Remove unused HAVE_COPY_THREAD_TLS macro
  sh: remove CONFIG_IDE from most defconfig
  sh: mm: Convert to DEFINE_SHOW_ATTRIBUTE
  sh: intc: Convert to DEFINE_SHOW_ATTRIBUTE
  arch/sh: hyphenate Non-Uniform in Kconfig prompt
  sh: dma: fix kconfig dependency for G2_DMA

3 years agoMerge tag 'io_uring-5.11-2021-01-24' of git://git.kernel.dk/linux-block
Linus Torvalds [Sun, 24 Jan 2021 20:30:14 +0000 (12:30 -0800)]
Merge tag 'io_uring-5.11-2021-01-24' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Still need a final cancelation fix that isn't quite done done,
  expected in the next day or two. That said, this contains:

   - Wakeup fix for IOPOLL requests

   - SQPOLL split close op handling fix

   - Ensure that any use of io_uring fd itself is marked as inflight

   - Short non-regular file read fix (Pavel)

   - Fix up bad false positive warning (Pavel)

   - SQPOLL fixes (Pavel)

   - In-flight removal fix (Pavel)"

* tag 'io_uring-5.11-2021-01-24' of git://git.kernel.dk/linux-block:
  io_uring: account io_uring internal files as REQ_F_INFLIGHT
  io_uring: fix sleeping under spin in __io_clean_op
  io_uring: fix short read retries for non-reg files
  io_uring: fix SQPOLL IORING_OP_CLOSE cancelation state
  io_uring: fix skipping disabling sqo on exec
  io_uring: fix uring_flush in exit_files() warning
  io_uring: fix false positive sqo warning on flush
  io_uring: iopoll requests should also wake task ->in_idle state

3 years agoMerge tag 'block-5.11-2021-01-24' of git://git.kernel.dk/linux-block
Linus Torvalds [Sun, 24 Jan 2021 20:24:35 +0000 (12:24 -0800)]
Merge tag 'block-5.11-2021-01-24' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull request from Christoph:
      - fix a status code in nvmet (Chaitanya Kulkarni)
      - avoid double completions in nvme-rdma/nvme-tcp (Chao Leng)
      - fix the CMB support to cope with NVMe 1.4 controllers (Klaus Jensen)
      - fix PRINFO handling in the passthrough ioctl (Revanth Rajashekar)
      - fix a double DMA unmap in nvme-pci

 - lightnvm error path leak fix (Pan)

 - MD pull request from Song:
      - Flush request fix (Xiao)

* tag 'block-5.11-2021-01-24' of git://git.kernel.dk/linux-block:
  lightnvm: fix memory leak when submit fails
  nvme-pci: fix error unwind in nvme_map_data
  nvme-pci: refactor nvme_unmap_data
  md: Set prev_flush_start and flush_bio in an atomic way
  nvmet: set right status on error in id-ns handler
  nvme-pci: allow use of cmb on v1.4 controllers
  nvme-tcp: avoid request double completion for concurrent nvme_tcp_timeout
  nvme-rdma: avoid request double completion for concurrent nvme_rdma_timeout
  nvme: check the PRINFO bit before deciding the host buffer length

3 years agoMerge branch 'akpm' (patches from Andrew)
Linus Torvalds [Sun, 24 Jan 2021 20:16:34 +0000 (12:16 -0800)]
Merge branch 'akpm' (patches from Andrew)

Merge misc fixes from Andrew Morton:
 "18 patches.

  Subsystems affected by this patch series: mm (pagealloc, memcg, kasan,
  memory-failure, and highmem), ubsan, proc, and MAINTAINERS"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  MAINTAINERS: add a couple more files to the Clang/LLVM section
  proc_sysctl: fix oops caused by incorrect command parameters
  powerpc/mm/highmem: use __set_pte_at() for kmap_local()
  mips/mm/highmem: use set_pte() for kmap_local()
  mm/highmem: prepare for overriding set_pte_at()
  sparc/mm/highmem: flush cache and TLB
  mm: fix page reference leak in soft_offline_page()
  ubsan: disable unsigned-overflow check for i386
  kasan, mm: fix resetting page_alloc tags for HW_TAGS
  kasan, mm: fix conflicts with init_on_alloc/free
  kasan: fix HW_TAGS boot parameters
  kasan: fix incorrect arguments passing in kasan_add_zero_shadow
  kasan: fix unaligned address is unhandled in kasan_remove_zero_shadow
  mm: fix numa stats for thp migration
  mm: memcg: fix memcg file_dirty numa stat
  mm: memcg/slab: optimize objcg stock draining
  mm: fix initialization of struct page for holes in memory layout
  x86/setup: don't remove E820_TYPE_RAM for pfn 0

3 years agoMerge tag 'char-misc-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregk...
Linus Torvalds [Sun, 24 Jan 2021 19:26:46 +0000 (11:26 -0800)]
Merge tag 'char-misc-5.11-rc5' of git://git./linux/kernel/git/gregkh/char-misc

Pull char/misc driver fixes from Greg KH:
 "Here are some small char/misc driver fixes for 5.11-rc5:

   - habanalabs driver fixes

   - phy driver fixes

   - hwtracing driver fixes

   - rtsx cardreader driver fix

  All of these have been in linux-next with no reported issues"

* tag 'char-misc-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  misc: rtsx: init value of aspm_enabled
  habanalabs: disable FW events on device removal
  habanalabs: fix backward compatibility of idle check
  habanalabs: zero pci counters packet before submit to FW
  intel_th: pci: Add Alder Lake-P support
  stm class: Fix module init return on allocation failure
  habanalabs: prevent soft lockup during unmap
  habanalabs: fix reset process in case of failures
  habanalabs: fix dma_addr passed to dma_mmap_coherent
  phy: mediatek: allow compile-testing the dsi phy
  phy: cpcap-usb: Fix warning for missing regulator_disable
  PHY: Ingenic: fix unconditional build of phy-ingenic-usb

3 years agoMerge tag 'driver-core-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 24 Jan 2021 19:05:48 +0000 (11:05 -0800)]
Merge tag 'driver-core-5.11-rc5' of git://git./linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg KH:
 "Here are some small driver core fixes for 5.11-rc5 that resolve some
  reported problems:

   - revert of a -rc1 patch that was causing problems with some machines

   - device link device name collision problem fix (busses only have to
     name devices unique to their bus, not unique to all busses)

   - kernfs splice bugfixes to resolve firmware loading problems for
     Qualcomm systems.

   - other tiny driver core fixes for minor issues reported.

  All of these have been in linux-next with no reported problems"

* tag 'driver-core-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  driver core: Fix device link device name collision
  driver core: Extend device_is_dependent()
  kernfs: wire up ->splice_read and ->splice_write
  kernfs: implement ->write_iter
  kernfs: implement ->read_iter
  Revert "driver core: Reorder devices on successful probe"
  Driver core: platform: Add extra error check in devm_platform_get_irqs_affinity()
  drivers core: Free dma_range_map when driver probe failed

3 years agoMerge tag 'staging-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh...
Linus Torvalds [Sun, 24 Jan 2021 19:02:01 +0000 (11:02 -0800)]
Merge tag 'staging-5.11-rc5' of git://git./linux/kernel/git/gregkh/staging

Pull staging/IIO driver fixes from Greg KH:
 "Here are some IIO driver fixes for 5.11-rc5 to resolve some reported
  problems.

  Nothing major, just a few small fixes, all of these have been in
  linux-next for a while and full details are in the shortlog"

* tag 'staging-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  iio: sx9310: Fix semtech,avg-pos-strength setting when > 16
  iio: common: st_sensors: fix possible infinite loop in st_sensors_irq_thread
  iio: ad5504: Fix setting power-down state
  counter:ti-eqep: remove floor
  drivers: iio: temperature: Add delay after the addressed reset command in mlx90632.c
  iio: adc: ti_am335x_adc: remove omitted iio_kfifo_free()
  dt-bindings: iio: accel: bma255: Fix bmc150/bmi055 compatible
  iio: sx9310: Off by one in sx9310_read_thresh()

3 years agoMerge tag 'tty-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Linus Torvalds [Sun, 24 Jan 2021 18:56:45 +0000 (10:56 -0800)]
Merge tag 'tty-5.11-rc5' of git://git./linux/kernel/git/gregkh/tty

Pull tty/serial fixes from Greg KH:
 "Here are three small tty/serial fixes for 5.11-rc5 to resolve reported
  problems:

   - two patches to fix up writing to ttys with splice

   - mvebu-uart driver fix for reported problem

  All of these have been in linux-next with no reported problems"

* tag 'tty-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  tty: fix up hung_up_tty_write() conversion
  tty: implement write_iter
  serial: mvebu-uart: fix tx lost characters at power off

3 years agoMerge tag 'usb-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Linus Torvalds [Sun, 24 Jan 2021 18:54:54 +0000 (10:54 -0800)]
Merge tag 'usb-5.11-rc5' of git://git./linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
 "Here are some small USB driver fixes for 5.11-rc5.  They resolve:

   - xhci issues for some reported problems

   - ehci driver issue for one specific device

   - USB gadget fixes for some reported problems

   - cdns3 driver fixes for issues reported

   - MAINTAINERS file update

   - thunderbolt minor fix

  All of these have been in linux-next with no reported issues"

* tag 'usb-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  usb: bdc: Make bdc pci driver depend on BROKEN
  xhci: tegra: Delay for disabling LFPS detector
  xhci: make sure TRB is fully written before giving it to the controller
  usb: udc: core: Use lock when write to soft_connect
  USB: gadget: dummy-hcd: Fix errors in port-reset handling
  usb: gadget: aspeed: fix stop dma register setting.
  USB: ehci: fix an interrupt calltrace error
  ehci: fix EHCI host controller initialization sequence
  MAINTAINERS: update Peter Chen's email address
  thunderbolt: Drop duplicated 0x prefix from format string
  MAINTAINERS: Update address for Cadence USB3 driver
  usb: cdns3: imx: improve driver .remove API
  usb: cdns3: imx: fix can't create core device the second time issue
  usb: cdns3: imx: fix writing read-only memory issue

3 years agoMAINTAINERS: add a couple more files to the Clang/LLVM section
Nathan Chancellor [Sun, 24 Jan 2021 05:02:21 +0000 (21:02 -0800)]
MAINTAINERS: add a couple more files to the Clang/LLVM section

The K: entry should ensure that Nick and I always get CC'd on patches that
touch these files but it is better to be explicit rather than implicit.

Link: https://lkml.kernel.org/r/20210114004059.2129921-1-natechancellor@gmail.com
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoproc_sysctl: fix oops caused by incorrect command parameters
Xiaoming Ni [Sun, 24 Jan 2021 05:02:16 +0000 (21:02 -0800)]
proc_sysctl: fix oops caused by incorrect command parameters

The process_sysctl_arg() does not check whether val is empty before
invoking strlen(val).  If the command line parameter () is incorrectly
configured and val is empty, oops is triggered.

For example:
  "hung_task_panic=1" is incorrectly written as "hung_task_panic", oops is
  triggered. The call stack is as follows:
    Kernel command line: .... hung_task_panic
    ......
    Call trace:
    __pi_strlen+0x10/0x98
    parse_args+0x278/0x344
    do_sysctl_args+0x8c/0xfc
    kernel_init+0x5c/0xf4
    ret_from_fork+0x10/0x30

To fix it, check whether "val" is empty when "phram" is a sysctl field.
Error codes are returned in the failure branch, and error logs are
generated by parse_args().

Link: https://lkml.kernel.org/r/20210118133029.28580-1-nixiaoming@huawei.com
Fixes: 3db978d480e2843 ("kernel/sysctl: support setting sysctl parameters from kernel command line")
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: <stable@vger.kernel.org> [5.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agopowerpc/mm/highmem: use __set_pte_at() for kmap_local()
Thomas Gleixner [Sun, 24 Jan 2021 05:02:11 +0000 (21:02 -0800)]
powerpc/mm/highmem: use __set_pte_at() for kmap_local()

The original PowerPC highmem mapping function used __set_pte_at() to
denote that the mapping is per CPU.  This got lost with the conversion
to the generic implementation.

Override the default map function.

Link: https://lkml.kernel.org/r/20210112170411.281464308@linutronix.de
Fixes: 47da42b27a56 ("powerpc/mm/highmem: Switch to generic kmap atomic")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Paul Cercueil <paul@crapouillou.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomips/mm/highmem: use set_pte() for kmap_local()
Thomas Gleixner [Sun, 24 Jan 2021 05:02:07 +0000 (21:02 -0800)]
mips/mm/highmem: use set_pte() for kmap_local()

set_pte_at() on MIPS invokes update_cache() which might recurse into
kmap_local().

Use set_pte() like the original MIPS highmem implementation did.

Link: https://lkml.kernel.org/r/20210112170411.187513575@linutronix.de
Fixes: a4c33e83bca1 ("mips/mm/highmem: Switch to generic kmap atomic")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Paul Cercueil <paul@crapouillou.net>
Reported-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm/highmem: prepare for overriding set_pte_at()
Thomas Gleixner [Sun, 24 Jan 2021 05:02:02 +0000 (21:02 -0800)]
mm/highmem: prepare for overriding set_pte_at()

The generic kmap_local() map function uses set_pte_at(), but MIPS requires
set_pte() and PowerPC wants __set_pte_at().

Provide arch_kmap_local_set_pte() and default it to set_pte_at().

Link: https://lkml.kernel.org/r/20210112170411.056306194@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Cercueil <paul@crapouillou.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agosparc/mm/highmem: flush cache and TLB
Thomas Gleixner [Sun, 24 Jan 2021 05:01:57 +0000 (21:01 -0800)]
sparc/mm/highmem: flush cache and TLB

Patch series "mm/highmem: Fix fallout from generic kmap_local
conversions".

The kmap_local conversion wreckaged sparc, mips and powerpc as it missed
some of the details in the original implementation.

This patch (of 4):

The recent conversion to the generic kmap_local infrastructure failed to
assign the proper pre/post map/unmap flush operations for sparc.

Sparc requires cache flush before map/unmap and tlb flush afterwards.

Link: https://lkml.kernel.org/r/20210112170136.078559026@linutronix.de
Link: https://lkml.kernel.org/r/20210112170410.905976187@linutronix.de
Fixes: 3293efa97807 ("sparc/mm/highmem: Switch to generic kmap atomic")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Andreas Larsson <andreas@gaisler.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Cercueil <paul@crapouillou.net>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agomm: fix page reference leak in soft_offline_page()
Dan Williams [Sun, 24 Jan 2021 05:01:52 +0000 (21:01 -0800)]
mm: fix page reference leak in soft_offline_page()

The conversion to move pfn_to_online_page() internal to
soft_offline_page() missed that the get_user_pages() reference taken by
the madvise() path needs to be dropped when pfn_to_online_page() fails.

Note the direct sysfs-path to soft_offline_page() does not perform a
get_user_pages() lookup.

When soft_offline_page() is handed a pfn_valid() && !pfn_to_online_page()
pfn the kernel hangs at dax-device shutdown due to a leaked reference.

Link: https://lkml.kernel.org/r/161058501210.1840162.8108917599181157327.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: feec24a6139d ("mm, soft-offline: convert parameter to pfn")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoubsan: disable unsigned-overflow check for i386
Arnd Bergmann [Sun, 24 Jan 2021 05:01:48 +0000 (21:01 -0800)]
ubsan: disable unsigned-overflow check for i386

Building ubsan kernels even for compile-testing introduced these
warnings in my randconfig environment:

  crypto/blake2b_generic.c:98:13: error: stack frame size of 9636 bytes in function 'blake2b_compress' [-Werror,-Wframe-larger-than=]
  static void blake2b_compress(struct blake2b_state *S,

  crypto/sha512_generic.c:151:13: error: stack frame size of 1292 bytes in function 'sha512_generic_block_fn' [-Werror,-Wframe-larger-than=]
  static void sha512_generic_block_fn(struct sha512_state *sst, u8 const *src,

  lib/crypto/curve25519-fiat32.c:312:22: error: stack frame size of 2180 bytes in function 'fe_mul_impl' [-Werror,-Wframe-larger-than=]
  static noinline void fe_mul_impl(u32 out[10], const u32 in1[10], const u32 in2[10])

  lib/crypto/curve25519-fiat32.c:444:22: error: stack frame size of 1588 bytes in function 'fe_sqr_impl' [-Werror,-Wframe-larger-than=]
  static noinline void fe_sqr_impl(u32 out[10], const u32 in1[10])

Further testing showed that this is caused by
-fsanitize=unsigned-integer-overflow, but is isolated to the 32-bit x86
architecture.

The one in blake2b immediately overflows the 8KB stack area
architectures, so better ensure this never happens by disabling the
option for 32-bit x86.

Link: https://lkml.kernel.org/r/20210112202922.2454435-1-arnd@kernel.org
Link: https://lore.kernel.org/lkml/20201230154749.746641-1-arnd@kernel.org/
Fixes: d0a3ac549f38 ("ubsan: enable for all*config builds")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Marco Elver <elver@google.com>
Cc: George Popescu <georgepope@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agokasan, mm: fix resetting page_alloc tags for HW_TAGS
Andrey Konovalov [Sun, 24 Jan 2021 05:01:43 +0000 (21:01 -0800)]
kasan, mm: fix resetting page_alloc tags for HW_TAGS

A previous commit added resetting KASAN page tags to
kernel_init_free_pages() to avoid false-positives due to accesses to
metadata with the hardware tag-based mode.

That commit did reset page tags before the metadata access, but didn't
restore them after.  As the result, KASAN fails to detect bad accesses
to page_alloc allocations on some configurations.

Fix this by recovering the tag after the metadata access.

Link: https://lkml.kernel.org/r/02b5bcd692e912c27d484030f666b350ad7e4ae4.1611074450.git.andreyknvl@google.com
Fixes: aa1ef4d7b3f6 ("kasan, mm: reset tags when accessing metadata")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agokasan, mm: fix conflicts with init_on_alloc/free
Andrey Konovalov [Sun, 24 Jan 2021 05:01:38 +0000 (21:01 -0800)]
kasan, mm: fix conflicts with init_on_alloc/free

A few places where SLUB accesses object's data or metadata were missed
in a previous patch.  This leads to false positives with hardware
tag-based KASAN when bulk allocations are used with init_on_alloc/free.

Fix the false-positives by resetting pointer tags during these accesses.

(The kasan_reset_tag call is removed from slab_alloc_node, as it's added
 into maybe_wipe_obj_freeptr.)

Link: https://linux-review.googlesource.com/id/I50dd32838a666e173fe06c3c5c766f2c36aae901
Link: https://lkml.kernel.org/r/093428b5d2ca8b507f4a79f92f9929b35f7fada7.1610731872.git.andreyknvl@google.com
Fixes: aa1ef4d7b3f67 ("kasan, mm: reset tags when accessing metadata")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agokasan: fix HW_TAGS boot parameters
Andrey Konovalov [Sun, 24 Jan 2021 05:01:34 +0000 (21:01 -0800)]
kasan: fix HW_TAGS boot parameters

The initially proposed KASAN command line parameters are redundant.

This change drops the complex "kasan.mode=off/prod/full" parameter and
adds a simpler kill switch "kasan=off/on" instead.  The new parameter
together with the already existing ones provides a cleaner way to
express the same set of features.

The full set of parameters with this change:

  kasan=off/on             - whether KASAN is enabled
  kasan.fault=report/panic - whether to only print a report or also panic
  kasan.stacktrace=off/on  - whether to collect alloc/free stack traces

Default values:

  kasan=on
  kasan.fault=report
  kasan.stacktrace=on  (if CONFIG_DEBUG_KERNEL=y)
  kasan.stacktrace=off (otherwise)

Link: https://linux-review.googlesource.com/id/Ib3694ed90b1e8ccac6cf77dfd301847af4aba7b8
Link: https://lkml.kernel.org/r/4e9c4a4bdcadc168317deb2419144582a9be6e61.1610736745.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agokasan: fix incorrect arguments passing in kasan_add_zero_shadow
Lecopzer Chen [Sun, 24 Jan 2021 05:01:29 +0000 (21:01 -0800)]
kasan: fix incorrect arguments passing in kasan_add_zero_shadow

kasan_remove_zero_shadow() shall use original virtual address, start and
size, instead of shadow address.

Link: https://lkml.kernel.org/r/20210103063847.5963-1-lecopzer@gmail.com
Fixes: 0207df4fa1a86 ("kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN")
Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agokasan: fix unaligned address is unhandled in kasan_remove_zero_shadow
Lecopzer Chen [Sun, 24 Jan 2021 05:01:25 +0000 (21:01 -0800)]
kasan: fix unaligned address is unhandled in kasan_remove_zero_shadow

During testing kasan_populate_early_shadow and kasan_remove_zero_shadow,
if the shadow start and end address in kasan_remove_zero_shadow() is not
aligned to PMD_SIZE, the remain unaligned PTE won't be removed.

In the test case for kasan_remove_zero_shadow():

    shadow_start: 0xffffffb802000000, shadow end: 0xffffffbfbe000000

    3-level page table:
      PUD_SIZE: 0x40000000 PMD_SIZE: 0x200000 PAGE_SIZE: 4K

0xffffffbf80000000 ~ 0xffffffbfbdf80000 will not be removed because in
kasan_remove_pud_table(), kasan_pmd_table(*pud) is true but the next
address is 0xffffffbfbdf80000 which is not aligned to PUD_SIZE.

In the correct condition, this should fallback to the next level
kasan_remove_pmd_table() but the condition flow always continue to skip
the unaligned part.

Fix by correcting the condition when next and addr are neither aligned.

Link: https://lkml.kernel.org/r/20210103135621.83129-1-lecopzer@gmail.com
Fixes: 0207df4fa1a86 ("kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN")
Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: YJ Chiang <yj.chiang@mediatek.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years agoMerge tag 'irq_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 24 Jan 2021 18:24:20 +0000 (10:24 -0800)]
Merge tag 'irq_urgent_for_v5.11_rc5' of git://git./linux/kernel/git/tip/tip

Pull irq fixes from Borislav Petkov:

 - Fix a kernel panic in mips-cpu due to invalid irq domain hierarchy.

 - Fix to not lose IPIs on bcm2836.

 - Fix for a bogus marking of ITS devices as shared due to unitialized
   stack variable.

 - Clear a phantom interrupt on qcom-pdc to unblock suspend.

 - Small cleanups, warning and build fixes.

* tag 'irq_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Export irq_check_status_bit()
  irqchip/mips-cpu: Set IPI domain parent chip
  irqchip/pruss: Simplify the TI_PRUSS_INTC Kconfig
  irqchip/loongson-liointc: Fix build warnings
  driver core: platform: Add extra error check in devm_platform_get_irqs_affinity()
  irqchip/bcm2836: Fix IPI acknowledgement after conversion to handle_percpu_devid_irq
  irqchip/irq-sl28cpld: Convert comma to semicolon
  genirq/msi: Initialize msi_alloc_info before calling msi_domain_prepare_irqs()

3 years agoMerge tag 'objtool_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 24 Jan 2021 18:17:03 +0000 (10:17 -0800)]
Merge tag 'objtool_urgent_for_v5.11_rc5' of git://git./linux/kernel/git/tip/tip

Pull objtool fixes from Borislav Petkov:

 - Adjust objtool to handle a recent binutils change to not generate
   unused symbols anymore.

 - Revert the fail-the-build-on-fatal-errors objtool strategy for now
   due to the ever-increasing matrix of supported toolchains/plugins and
   them causing too many such fatal errors currently.

 - Do not add empty symbols to objdump's rbtree to accommodate clang
   removing section symbols.

* tag 'objtool_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool: Don't fail on missing symbol table
  objtool: Don't fail the kernel build on fatal errors
  objtool: Don't add empty symbols to the rbtree

3 years agoMerge tag 'sched_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 24 Jan 2021 18:09:20 +0000 (10:09 -0800)]
Merge tag 'sched_urgent_for_v5.11_rc5' of git://git./linux/kernel/git/tip/tip

Pull scheduler fixes from Borislav Petkov:

 - Correct the marking of kthreads which are supposed to run on a
   specific, single CPU vs such which are affine to only one CPU, mark
   per-cpu workqueue threads as such and make sure that marking
   "survives" CPU hotplug. Fix CPU hotplug issues with such kthreads.

 - A fix to not push away tasks on CPUs coming online.

 - Have workqueue CPU hotplug code use cpu_possible_mask when breaking
   affinity on CPU offlining so that pending workers can finish on newly
   arrived onlined CPUs too.

 - Dump tasks which haven't vacated a CPU which is currently being
   unplugged.

 - Register a special scale invariance callback which gets called on
   resume from RAM to read out APERF/MPERF after resume and thus make
   the schedutil scaling governor more precise.

* tag 'sched_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Relax the set_cpus_allowed_ptr() semantics
  sched: Fix CPU hotplug / tighten is_per_cpu_kthread()
  sched: Prepare to use balance_push in ttwu()
  workqueue: Restrict affinity change to rescuer
  workqueue: Tag bound workers with KTHREAD_IS_PER_CPU
  kthread: Extract KTHREAD_IS_PER_CPU
  sched: Don't run cpu-online with balance_push() enabled
  workqueue: Use cpu_possible_mask instead of cpu_active_mask to break affinity
  sched/core: Print out straggler tasks in sched_cpu_dying()
  x86: PM: Register syscore_ops for scale invariance

3 years agoMerge tag 'timers_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 24 Jan 2021 17:58:38 +0000 (09:58 -0800)]
Merge tag 'timers_urgent_for_v5.11_rc5' of git://git./linux/kernel/git/tip/tip

Pull timer fixes from Borislav Petkov:

 - Fix an integer overflow in the NTP RTC synchronization which led to
   the latter happening every 2 seconds instead of the intended every 11
   minutes.

 - Get rid of now unused get_seconds().

* tag 'timers_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  ntp: Fix RTC synchronization on 32-bit platforms
  timekeeping: Remove unused get_seconds()

3 years agoMerge tag 'x86_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 24 Jan 2021 17:46:05 +0000 (09:46 -0800)]
Merge tag 'x86_urgent_for_v5.11_rc5' of git://git./linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Add a new Intel model number for Alder Lake

 - Differentiate which aspects of the FPU state get saved/restored when
   the FPU is used in-kernel and fix a boot crash on K7 due to early
   MXCSR access before CR4.OSFXSR is even set.

 - A couple of noinstr annotation fixes

 - Correct die ID setting on AMD for users of topology information which
   need the correct die ID

 - A SEV-ES fix to handle string port IO to/from kernel memory properly

* tag 'x86_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu: Add another Alder Lake CPU to the Intel family
  x86/mmx: Use KFPU_387 for MMX string operations
  x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state
  x86/topology: Make __max_die_per_package available unconditionally
  x86: __always_inline __{rd,wr}msr()
  x86/mce: Remove explicit/superfluous tracing
  locking/lockdep: Avoid noinstr warning for DEBUG_LOCKDEP
  locking/lockdep: Cure noinstr fail
  x86/sev: Fix nonistr violation
  x86/entry: Fix noinstr fail
  x86/cpu/amd: Set __max_die_per_package on AMD
  x86/sev-es: Handle string port IO to kernel memory properly

3 years agoMerge tag 'powerpc-5.11-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc...
Linus Torvalds [Sun, 24 Jan 2021 17:40:51 +0000 (09:40 -0800)]
Merge tag 'powerpc-5.11-5' of git://git./linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

 - Fix a bad interaction between the scv handling and the fallback L1D
   flush, which could lead to user register corruption. Only affects
   people using scv (~no one) on machines with old firmware that are
   missing the L1D flush.

 - Two small selftest fixes.

Thanks to Eirik Fuller, Libor Pechacek, Nicholas Piggin, Sandipan Das,
and Tulio Magno Quites Machado Filho.

* tag 'powerpc-5.11-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/64s: fix scv entry fallback flush vs interrupt
  selftests/powerpc: Only test lwm/stmw on big endian
  selftests/powerpc: Fix exit status of pkey tests

3 years agoMerge tag 'for-linus-2021-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 24 Jan 2021 17:35:28 +0000 (09:35 -0800)]
Merge tag 'for-linus-2021-01-24' of git://git./linux/kernel/git/brauner/linux

Pull misc fixes from Christian Brauner:

 - Jann reported sparse complaints because of a missing __user
   annotation in a helper we added way back when we added
   pidfd_send_signal() to avoid compat syscall handling. Fix it.

 - Yanfei replaces a reference in a comment to the _do_fork() helper I
   removed a while ago with a reference to the new kernel_clone()
   replacement

 - Alexander Guril added a simple coding style fix

* tag 'for-linus-2021-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  kthread: remove comments about old _do_fork() helper
  Kernel: fork.c: Fix coding style: Do not use {} around single-line statements
  signal: Add missing __user annotation to copy_siginfo_from_user_any