platform/kernel/linux-rpi.git
3 years agonull_blk: discard zones on reset
Damien Le Moal [Fri, 20 Nov 2020 01:55:17 +0000 (10:55 +0900)]
null_blk: discard zones on reset

When memory backing is enabled, use null_handle_discard() to free the
backing memory used by a zone when the zone is being reset.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonull_blk: cleanup discard handling
Damien Le Moal [Fri, 20 Nov 2020 01:55:16 +0000 (10:55 +0900)]
null_blk: cleanup discard handling

null_handle_discard() is called from both null_handle_rq() and
null_handle_bio(). As these functions are only passed a nullb_cmd
structure, this forces pointer dereferences to identiify the discard
operation code and to access the sector range to be discarded.
Simplify all this by changing the interface of the functions
null_handle_discard() and null_handle_memory_backed() to pass along
the operation code, operation start sector and number of sectors. With
this change null_handle_discard() can be called directly from
null_handle_memory_backed().

Also add a message warning that the discard configuration attribute has
no effect when memory backing is disabled.

No functional change is introduced by this patch.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonull_blk: Improve implicit zone close
Damien Le Moal [Fri, 20 Nov 2020 01:55:15 +0000 (10:55 +0900)]
null_blk: Improve implicit zone close

When open zone resource management is enabled, that is, when a null_blk
zoned device is created with zone_max_open different than 0, implicitly
or explicitly opening a zone may require implicitly closing a zone
that is already implicitly open. This operation is done using the
function null_close_first_imp_zone(), which search for an implicitly
open zone to close starting from the first sequential zone. This
implementation is simple but may result in the same being constantly
implicitly closed and then implicitly reopened on write, namely, the
lowest numbered zone that is being written.

Avoid this by starting the search for an implicitly open zone starting
from the zone following the last zone that was implicitly closed. The
function null_close_first_imp_zone() is renamed
null_close_imp_open_zone().

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonull_blk: improve zone locking
Damien Le Moal [Fri, 20 Nov 2020 01:55:14 +0000 (10:55 +0900)]
null_blk: improve zone locking

With memory backing disabled, using a single spinlock for protecting
zone information and zone resource management prevents the parallel
execution on multiple queue of IO requests to different zones.
Furthermore, regardless of the use of memory backing, if a null_blk
device is created without limits on the number of open and active zones,
accounting for zone resource management is not necessary.

>From these observations, zone locking is changed as follows to improve
performance:
1) the zone_lock spinlock is renamed zone_res_lock and used only if
   zone resource management is necessary, that is, if either
   zone_max_open or zone_max_active are not 0. This is indicated using
   the new boolean need_zone_res_mgmt in the nullb_device structure.
   null_zone_write() is modified to reduce the amount of code executed
   with the zone_res_lock spinlock held.
2) With memory backing disabled, per zone locking is changed to a
   spinlock per zone.
3) Introduce the structure nullb_zone to replace the use of
   struct blk_zone for zone information. This new structure includes a
   union of a spinlock and a mutex for zone locking. The spinlock is
   used when memory backing is disabled and the mutex is used with
   memory backing.

With these changes, fio performance with zonemode=zbd for 4K random
read and random write on a dual socket (24 cores per socket) machine
using the none schedulder is as follows:

before patch:
write (psync x 96 jobs) = 465 KIOPS
read (libaio@qd=8 x 96 jobs) = 1361 KIOPS
after patch:
write (psync x 96 jobs) = 456 KIOPS
read (libaio@qd=8 x 96 jobs) = 4096 KIOPS

Write performance remains mostly unchanged but read performance is three
times higher. Performance when using the mq-deadline scheduler is not
changed by this patch as mq-deadline becomes the bottleneck for a
multi-queue device.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: Align max_hw_sectors to logical blocksize
Damien Le Moal [Fri, 20 Nov 2020 01:55:13 +0000 (10:55 +0900)]
block: Align max_hw_sectors to logical blocksize

Block device drivers do not have to call blk_queue_max_hw_sectors() to
set a limit on request size if the default limit BLK_SAFE_MAX_SECTORS
is acceptable. However, this limit (255 sectors) may not be aligned
to the device logical block size which cannot be used as is for a
request maximum size. This is the case for the null_blk device driver.

Modify blk_queue_max_hw_sectors() to make sure that the request size
limits specified by the max_hw_sectors and max_sectors queue limits
are always aligned to the device logical block size. Additionally, to
avoid introducing a dependence on the execution order of this function
with blk_queue_logical_block_size(), also modify
blk_queue_logical_block_size() to perform the same alignment when the
logical block size is set after max_hw_sectors.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonull_blk: Fail zone append to conventional zones
Damien Le Moal [Fri, 20 Nov 2020 01:55:12 +0000 (10:55 +0900)]
null_blk: Fail zone append to conventional zones

Conventional zones do not have a write pointer and so cannot accept zone
append writes. Make sure to fail any zone append write command issued to
a conventional zone.

Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
Fixes: e0489ed5daeb ("null_blk: Support REQ_OP_ZONE_APPEND")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonull_blk: Fix zone size initialization
Damien Le Moal [Fri, 20 Nov 2020 01:55:11 +0000 (10:55 +0900)]
null_blk: Fix zone size initialization

For a null_blk device with zoned mode enabled is currently initialized
with a number of zones equal to the device capacity divided by the zone
size, without considering if the device capacity is a multiple of the
zone size. If the zone size is not a divisor of the capacity, the zones
end up not covering the entire capacity, potentially resulting is out
of bounds accesses to the zone array.

Fix this by adding one last smaller zone with a size equal to the
remainder of the disk capacity divided by the zone size if the capacity
is not a multiple of the zone size. For such smaller last zone, the zone
capacity is also checked so that it does not exceed the smaller zone
size.

Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
Fixes: ca4b2a011948 ("null_blk: add zone support")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agobcache: fix race between setting bdev state to none and new write request direct...
Dongsheng Yang [Mon, 7 Dec 2020 16:39:15 +0000 (00:39 +0800)]
bcache: fix race between setting bdev state to none and new write request direct to backing

There is a race condition in detaching as below:
A. detaching B. Write request
(1) writing back
(2) write back done, set bdev
    state to clean.
(3) cached_dev_put() and
    schedule_work(&dc->detach);
(4) write data [0 - 4K] directly
    into backing and ack to user.
(5) power-failure...

When we restart this bcache device, this bdev is clean but not detached,
and read [0 - 4K], we will get unexpected old data from cache device.

To fix this problem, set the bdev state to none when we writeback done
in detaching, and then if power-failure happened as above, the data in
cache will not be used in next bcache device starting, it's detached, we
will read the correct data from backing derectly.

Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock/rnbd: fix a null pointer dereference on dev->blk_symlink_name
Colin Ian King [Mon, 7 Dec 2020 14:54:46 +0000 (14:54 +0000)]
block/rnbd: fix a null pointer dereference on dev->blk_symlink_name

Currently in the case where dev->blk_symlink_name fails to be allocates
the error return path attempts to set an end-of-string character to
the unallocated dev->blk_symlink_name causing a null pointer dereference
error. Fix this by returning with an explicity ENOMEM error (which also
is missing in the original code as was not initialized).

Fixes: 1eb54f8f5dd8 ("block/rnbd: client: sysfs interface functions")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Addresses-Coverity: ("Dereference after null check")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name
Md Haris Iqbal [Thu, 26 Nov 2020 10:47:23 +0000 (11:47 +0100)]
block/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name

For every rnbd_clt_dev, we alloc the pathname and blk_symlink_name
statically to NAME_MAX which is 255 bytes. In most of the cases we only
need less than 10 bytes, so 500 bytes per block device are wasted.

This commit dynamically allocates memory buffer for pathname and
blk_symlink_name.

Signed-off-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Lutz Pogrell <lutz.pogrell@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock/rnbd: call kobject_put in the failure path
Guoqing Jiang [Thu, 26 Nov 2020 10:47:22 +0000 (11:47 +0100)]
block/rnbd: call kobject_put in the failure path

Per the comment of kobject_init_and_add, we need to cleanup the memory
by call kobject_put.

Also we need to call kobject_del for the other failure cases if the
kobject_init_and_add doesn't fail.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoDocumentation/ABI/rnbd-srv: add document for force_close
Jack Wang [Thu, 26 Nov 2020 10:47:21 +0000 (11:47 +0100)]
Documentation/ABI/rnbd-srv: add document for force_close

describe force_close of device

Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock/rnbd-srv: close a mapped device from server side.
Lutz Pogrell [Thu, 26 Nov 2020 10:47:20 +0000 (11:47 +0100)]
block/rnbd-srv: close a mapped device from server side.

The forceful close of an exported device is required
for the use case, when the client side hangs, is crashed,
or is not accessible.

There have been cases observed, where only some of
the devices are to be cleaned up, but the session shall
remain.

When the device is to be exported to a different
client host, server side cleanup is required.

Signed-off-by: Lutz Pogrell <lutz.pogrell@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Gioh Kim <gi-oh.kim@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoDocumentation/ABI/rnbd-clt: session name is appended to the device path
Gioh Kim [Thu, 26 Nov 2020 10:47:19 +0000 (11:47 +0100)]
Documentation/ABI/rnbd-clt: session name is appended to the device path

When mapping a device,
/sys/devices/virtual/rnbd-client/ctl/devices/<device_id> was created.
But we found out that it had a problem when mapping the same file
on different servers. So we append the session name after the
device_id as below.
/sys/devices/virtual/rnbd-client/ctl/devices/<device_id>@<session_name>

Signed-off-by: Gioh Kim <gi-oh.kim@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoDocumentation/ABI/rnbd-clt: fix typo in sysfs-class-rnbd-client
Gioh Kim [Thu, 26 Nov 2020 10:47:18 +0000 (11:47 +0100)]
Documentation/ABI/rnbd-clt: fix typo in sysfs-class-rnbd-client

/sys/block/rnbd<N> is created, not /sys/block/rnbd_client/rnbd<N>

Signed-off-by: Gioh Kim <gi-oh.kim@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock/rnbd-clt: support mapping two devices with the same name from different servers
Guoqing Jiang [Thu, 26 Nov 2020 10:47:17 +0000 (11:47 +0100)]
block/rnbd-clt: support mapping two devices with the same name from different servers

Previously, we can't map same device name from different sessions
due to the limitation of sysfs naming mechanism.

root@clt2:~# ls -l /sys/class/rnbd-client/ctl/devices/
total 0
lrwxrwxrwx 1 root 0 Sep  2 16:31 !dev!nullb1 -> ../../../block/rnbd0

We only use the device name in above, which caused device with
the same name can't be mapped from another server. To address
the issue, the sessname is appended to the node to differentiate
where the device comes from.

Also, we need to check if the pathname is existed in a specific
session instead of search it in global sess_list.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Gioh Kim <gi-oh.kim@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock/rnbd-clt: Make path parameter optional for map_device
Md Haris Iqbal [Thu, 26 Nov 2020 10:47:16 +0000 (11:47 +0100)]
block/rnbd-clt: Make path parameter optional for map_device

During map_device if the given session exists, then the path parameter is
not used. In such a case, the path parameter is redundant.

This commit makes the path parameter optional for map_device. When the
path parameter is not given, if the session exists then that is used to
establish the rtrs connection.

If the session does not exist, and the path parameter is also missing,
then map_device fails.

Signed-off-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoMerge tag 'nvme-5.11-20201202' of git://git.infradead.org/nvme into for-5.11/drivers
Jens Axboe [Wed, 2 Dec 2020 15:56:55 +0000 (08:56 -0700)]
Merge tag 'nvme-5.11-20201202' of git://git.infradead.org/nvme into for-5.11/drivers

Pull NVMe updates from Christoph:

"nvme updates for 5.11

 - nvmet passthrough improvements (Chaitanya Kulkarni)
 - fcloop error injection support (James Smart)
 - read-only support for zoned namespaces without Zone Append
   (Javier González)
 - improve some error message (Minwoo Im)
 - reject I/O to offline fabrics namespaces (Victor Gladkov)
 - PCI queue allocation cleanups (Niklas Schnelle)
 - remove an unused allocation in nvmet (Amit Engel)
 - a Kconfig spelling fix (Colin Ian King)
 - nvme_req_qid simplication (Baolin Wang)"

* tag 'nvme-5.11-20201202' of git://git.infradead.org/nvme: (23 commits)
  nvme: export zoned namespaces without Zone Append support read-only
  nvme: rename bdev operations
  nvme: rename controller base dev_t char device
  nvme: remove unnecessary return values
  nvme: print a warning for when listing active namespaces fails
  nvme: improve an error message on Identify failure
  nvme-fabrics: reject I/O to offline device
  nvmet: fix a spelling mistake "incuding" -> "including" in Kconfig
  nvmet: make sure discovery change log event is protected
  nvmet: remove unused ctrl->cqs
  nvme-pci: don't allocate unused I/O queues
  nvme-pci: drop min() from nr_io_queues assignment
  nvmet: use inline bio for passthru fast path
  nvmet: use blk_rq_bio_prep instead of blk_rq_append_bio
  nvmet: remove op_flags for passthru commands
  nvme: split nvme_alloc_request()
  block: move blk_rq_bio_prep() to linux/blk-mq.h
  nvmet: add passthru io timeout value attr
  nvmet: add passthru admin timeout value attr
  nvme: use consistent macro name for timeout
  ...

3 years agonvme: export zoned namespaces without Zone Append support read-only
Javier González [Tue, 1 Dec 2020 12:02:21 +0000 (13:02 +0100)]
nvme: export zoned namespaces without Zone Append support read-only

Allow ZNS NVMe SSDs to present a read-only namespace when append is not
supported, instead of rejecting the namespace directly.

This allows (i) the namespace to be used in read-only mode, which is not
a problem as the append command only affects the write path, and (ii) to
use standard management tools such as nvme-cli to choose a different
format or firmware slot that is compatible with the Linux zoned block
device.

Signed-off-by: Javier González <javier.gonz@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme: rename bdev operations
Javier González [Tue, 1 Dec 2020 12:56:09 +0000 (13:56 +0100)]
nvme: rename bdev operations

Remane block device operations in preparation to add char device file
operations.

Signed-off-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme: rename controller base dev_t char device
Javier González [Tue, 1 Dec 2020 12:56:08 +0000 (13:56 +0100)]
nvme: rename controller base dev_t char device

Rename controller base dev_t char device in preparation for adding a
namespace char device.

Signed-off-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme: remove unnecessary return values
Javier González [Tue, 1 Dec 2020 12:56:07 +0000 (13:56 +0100)]
nvme: remove unnecessary return values

Cleanup unnecessary ret values that are not checked or used in
nvme_alloc_ns().

Signed-off-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme: print a warning for when listing active namespaces fails
Minwoo Im [Mon, 30 Nov 2020 12:47:47 +0000 (21:47 +0900)]
nvme: print a warning for when listing active namespaces fails

During the scan_work, an Identify command is issued to figure out which
namespaces are active.  If this command fails, the nvme driver falls back
to scanning namespaces sequentially.  In this situation, we don't see
any warnings and don't even know whether list-ns command has been failed
or not easiliy.

Printa warning when the Identify command executin fail:

[    1.108399] nvme nvme0: Identify NS List failed (status=0x400b)
[    1.109583] nvme0n1: detected capacity change from 0 to 1048576
[    1.112186] nvme nvme0: Identify Descriptors failed (nsid=2, status=0x4002)
[    1.113929] nvme nvme0: Identify Descriptors failed (nsid=3, status=0x4002)
[    1.116537] nvme nvme0: Identify Descriptors failed (nsid=4, status=0x4002)
...

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme: improve an error message on Identify failure
Minwoo Im [Mon, 30 Nov 2020 12:47:46 +0000 (21:47 +0900)]
nvme: improve an error message on Identify failure

Add the namespace ID to the error message when the Identify command used
to retrieve the Namespace Identification Descriptor list fails.

This avoids rather useless and duplicative messages like the following:
[    1.321031] nvme nvme0: Identify Descriptors failed (16386)
[    1.321948] nvme nvme0: Identify Descriptors failed (16386)
[    1.322872] nvme nvme0: Identify Descriptors failed (16386)
[    1.323775] nvme nvme0: Identify Descriptors failed (16386)
[    1.324687] nvme nvme0: Identify Descriptors failed (16386)
...

Also, print the nvme status code in hexadecimal rather than decimal
format rather for better readability.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme-fabrics: reject I/O to offline device
Victor Gladkov [Tue, 24 Nov 2020 18:34:59 +0000 (18:34 +0000)]
nvme-fabrics: reject I/O to offline device

Commands get stuck while Host NVMe-oF controller is in reconnect state.
The controller enters into reconnect state when it loses connection with
the target.  It tries to reconnect every 10 seconds (default) until
a successful reconnect or until the reconnect time-out is reached.
The default reconnect time out is 10 minutes.

Applications are expecting commands to complete with success or error
within a certain timeout (30 seconds by default).  The NVMe host is
enforcing that timeout while it is connected, but during reconnect the
timeout is not enforced and commands may get stuck for a long period or
even forever.

To fix this long delay due to the default timeout, introduce new
"fast_io_fail_tmo" session parameter.  The timeout is measured in seconds
from the controller reconnect and any command beyond that timeout is
rejected.  The new parameter value may be passed during 'connect'.
The default value of -1 means no timeout (similar to current behavior).

Signed-off-by: Victor Gladkov <victor.gladkov@kioxia.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvmet: fix a spelling mistake "incuding" -> "including" in Kconfig
Colin Ian King [Thu, 26 Nov 2020 22:40:29 +0000 (22:40 +0000)]
nvmet: fix a spelling mistake "incuding" -> "including" in Kconfig

There is a spelling mistake in the Kconfig help text. Fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvmet: make sure discovery change log event is protected
Max Gurtovoy [Wed, 25 Nov 2020 12:27:36 +0000 (12:27 +0000)]
nvmet: make sure discovery change log event is protected

Generation counter is protected by nvmet_config_sem. Make sure the
callers that call functions that might change it, are calling it
properly.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvmet: remove unused ctrl->cqs
Amit [Sun, 15 Nov 2020 12:19:51 +0000 (14:19 +0200)]
nvmet: remove unused ctrl->cqs

remove unused cqs from nvmet_ctrl struct
this will reduce the allocated memory.

Signed-off-by: Amit <amit.engel@dell.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme-pci: don't allocate unused I/O queues
Niklas Schnelle [Thu, 12 Nov 2020 08:23:02 +0000 (09:23 +0100)]
nvme-pci: don't allocate unused I/O queues

currently the NVME_QUIRK_SHARED_TAGS quirk for Apple devices is handled
during the assignment of nr_io_queues in nvme_setup_io_queues().
This however means that for these devices nvme_max_io_queues() will
actually not return the supported maximum which is confusing and
unexpected and also means that in nvme_probe() we are allocating
for I/O queues that will never be used.
Fix this by moving the quirk handling into nvme_max_io_queues().

Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme-pci: drop min() from nr_io_queues assignment
Niklas Schnelle [Thu, 12 Nov 2020 08:23:01 +0000 (09:23 +0100)]
nvme-pci: drop min() from nr_io_queues assignment

in nvme_setup_io_queues() the number of I/O queues is set to either 1 in
case of a quirky Apple device or to the min of nvme_max_io_queues() or
dev->nr_allocated_queues - 1.
This is unnecessarily complicated as dev->nr_allocated_queues is only
assigned once and is nvme_max_io_queues() + 1.

Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvmet: use inline bio for passthru fast path
Chaitanya Kulkarni [Tue, 10 Nov 2020 02:24:05 +0000 (18:24 -0800)]
nvmet: use inline bio for passthru fast path

In nvmet_passthru_execute_cmd() which is a high frequency function
it uses bio_alloc() which leads to memory allocation from the fs pool
for each I/O.

For NVMeoF nvmet_req we already have inline_bvec allocated as a part of
request allocation that can be used with preallocated bio when we
already know the size of request before bio allocation with bio_alloc(),
which we already do.

Introduce a bio member for the nvmet_req passthru anon union. In the
fast path, check if we can get away with inline bvec and bio from
nvmet_req with bio_init() call before actually allocating from the
bio_alloc().

This will be useful to avoid any new memory allocation under high
memory pressure situation and get rid of any extra work of
allocation (bio_alloc()) vs initialization (bio_init()) when
transfer len is < NVMET_MAX_INLINE_DATA_LEN that user can configure at
compile time.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvmet: use blk_rq_bio_prep instead of blk_rq_append_bio
Chaitanya Kulkarni [Tue, 10 Nov 2020 02:24:04 +0000 (18:24 -0800)]
nvmet: use blk_rq_bio_prep instead of blk_rq_append_bio

The function blk_rq_append_bio() is a genereric API written for all
types driver (having bounce buffers) and different context (where
request is already having a bio i.e. rq->bio != NULL).

It does mainly three things: calculating the segments, bounce queue and
if req->bio == NULL call blk_rq_bio_prep() or handle low level merge()
case.

The NVMe PCIe and fabrics transports currently does not use queue
bounce mechanism. In order to find this for each request processing
in the passthru blk_rq_append_bio() does extra work in the fast path
for each request.

When I ran I/Os with different block sizes on the passthru controller
I found that we can reuse the req->sg_cnt instead of iterating over the
bvecs to find out nr_segs in blk_rq_append_bio(). This calculation in
blk_rq_append_bio() is a duplication of work given that we have the
value in req->sg_cnt. (correct me here if I'm wrong).

With NVMe passthru request based driver we allocate fresh request each
time, so every call to blk_rq_append_bio() rq->bio will be NULL i.e.
we don't really need the second condition in the blk_rq_append_bio()
and the resulting error condition in the caller of blk_rq_append_bio().

So for NVMeOF passthru driver recalculating the segments, bounce check
and ll_back_merge code is not needed such that we can get away with the
minimal version of the blk_rq_append_bio() which removes the error check
in the fast path along with extra variable in nvmet_passthru_map_sg().

This patch updates the nvmet_passthru_map_sg() such that it does only
appending the bio to the request in the context of the NVMeOF Passthru
driver. Following are perf numbers :-

With current implementation (blk_rq_append_bio()) :-
----------------------------------------------------
+    5.80%     0.02%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
+    5.44%     0.01%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
+    4.88%     0.00%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
+    5.44%     0.01%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
+    4.86%     0.01%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
+    5.17%     0.00%  kworker/0:2-eve  [nvmet]  [k] nvmet_passthru_execute_cmd

With this patch using blk_rq_bio_prep() :-
----------------------------------------------------
+    3.14%     0.02%  kworker/0:2-eve  [nvmet]  [k] nvmet_passthru_execute_cmd
+    3.26%     0.01%  kworker/0:2-eve  [nvmet]  [k] nvmet_passthru_execute_cmd
+    5.37%     0.01%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
+    5.18%     0.02%  kworker/0:2-eve  [nvmet]  [k] nvmet_passthru_execute_cmd
+    4.84%     0.02%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
+    4.87%     0.01%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvmet: remove op_flags for passthru commands
Chaitanya Kulkarni [Tue, 10 Nov 2020 02:24:02 +0000 (18:24 -0800)]
nvmet: remove op_flags for passthru commands

For passthru commands setting op_flags has no meaning. Remove the code
that sets the op flags in nvmet_passthru_map_sg().

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme: split nvme_alloc_request()
Chaitanya Kulkarni [Tue, 10 Nov 2020 02:24:00 +0000 (18:24 -0800)]
nvme: split nvme_alloc_request()

Right now nvme_alloc_request() allocates a request from block layer
based on the value of the qid. When qid set to NVME_QID_ANY it used
blk_mq_alloc_request() else blk_mq_alloc_request_hctx().

The function nvme_alloc_request() is called from different context, The
only place where it uses non NVME_QID_ANY value is for fabrics connect
commands :-

nvme_submit_sync_cmd() NVME_QID_ANY
nvme_features() NVME_QID_ANY
nvme_sec_submit() NVME_QID_ANY
nvmf_reg_read32() NVME_QID_ANY
nvmf_reg_read64() NVME_QID_ANY
nvmf_reg_write32() NVME_QID_ANY
nvmf_connect_admin_queue() NVME_QID_ANY
nvme_submit_user_cmd() NVME_QID_ANY
nvme_alloc_request()
nvme_keep_alive() NVME_QID_ANY
nvme_alloc_request()
nvme_timeout() NVME_QID_ANY
nvme_alloc_request()
nvme_delete_queue() NVME_QID_ANY
nvme_alloc_request()
nvmet_passthru_execute_cmd() NVME_QID_ANY
nvme_alloc_request()
nvmf_connect_io_queue()  QID
__nvme_submit_sync_cmd()
nvme_alloc_request()

With passthru nvme_alloc_request() now falls into the I/O fast path such
that blk_mq_alloc_request_hctx() is never gets called and that adds
additional branch check in fast path.

Split the nvme_alloc_request() into nvme_alloc_request() and
nvme_alloc_request_qid().

Replace each call of the nvme_alloc_request() with NVME_QID_ANY param
with a call to newly added nvme_alloc_request() without NVME_QID_ANY.

Replace a call to nvme_alloc_request() with QID param with a call to
newly added nvme_alloc_request() and nvme_alloc_request_qid()
based on the qid value set in the __nvme_submit_sync_cmd().

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agoblock: move blk_rq_bio_prep() to linux/blk-mq.h
Chaitanya Kulkarni [Tue, 10 Nov 2020 02:24:03 +0000 (18:24 -0800)]
block: move blk_rq_bio_prep() to linux/blk-mq.h

This is a preparation patch to have minimal block layer request bio
append functionality in the context of the NVMeOF Passthru driver which
falls in the fast path and doesn't need calls from blk_rq_append_bio().

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvmet: add passthru io timeout value attr
Chaitanya Kulkarni [Tue, 10 Nov 2020 00:33:44 +0000 (16:33 -0800)]
nvmet: add passthru io timeout value attr

NVMeOF controller in the passsthru mode is capable of handling wide set
of I/O commands including vender specific passhtru io comands.

The vendor specific I/O commands are used to read the large drive
logs and can take longer than default NVMe commands, i.e. for
passthru requests the timeout value may differ from the passthru
controller's default timeout values (nvme-core:io_timeout).

Add a configfs attribute so that user can set the io timeout values.
In case if this configfs value is not set nvme_alloc_request() will set
the NVME_IO_TIMEOUT value when request queuedata is NULL.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvmet: add passthru admin timeout value attr
Chaitanya Kulkarni [Tue, 10 Nov 2020 00:33:43 +0000 (16:33 -0800)]
nvmet: add passthru admin timeout value attr

NVMeOF controller in the passsthru mode is capable of handling wide set
of admin commands including vender specific passhtru admin comands.

The vendor specific admin commands are used to read the large drive
logs and can take longer than default NVMe commands, i.e. for
passthru requests the timeout value may differ from the passthru
controller's default timeout values (nvme-core:admin_timeout).

Add a configfs attribute so that user can set the admin timeout values.
In case if this configfs value is not set nvme_alloc_request() will set
the ADMIN_TIMEOUT value when request queuedata is NULL.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme: use consistent macro name for timeout
Chaitanya Kulkarni [Tue, 10 Nov 2020 00:33:45 +0000 (16:33 -0800)]
nvme: use consistent macro name for timeout

This is purely a clenaup patch, add prefix NVME to the ADMIN_TIMEOUT to
make consistent with NVME_IO_TIMEOUT.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme: centralize setting the timeout in nvme_alloc_request
Chaitanya Kulkarni [Tue, 10 Nov 2020 00:33:42 +0000 (16:33 -0800)]
nvme: centralize setting the timeout in nvme_alloc_request

The function nvme_alloc_request() is called from different context
(I/O and Admin queue) where callers do not consider the I/O timeout when
called from I/O queue context.

Update nvme_alloc_request() to set the default I/O and Admin timeout
value based on whether the queuedata is set or not.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme: simplify nvme_req_qid()
Baolin Wang [Tue, 27 Oct 2020 08:15:16 +0000 (16:15 +0800)]
nvme: simplify nvme_req_qid()

Use the request's '->mq_hctx->queue_num' directly to simplify the
nvme_req_qid() function.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme-fcloop: add sysfs attribute to inject command drop
James Smart [Fri, 16 Oct 2020 21:28:38 +0000 (14:28 -0700)]
nvme-fcloop: add sysfs attribute to inject command drop

Add sysfs attribute to specify parameters for dropping a command.  The
attribute takes a string of:

  <opcode>:<starting a what instance>:<number of times>

Opcode is formatted as lower 8 bits are opcode.  If a fabrics opcode, a
bit above bits 7:0 will be set.

Once set, each sqe is looked at. If the opcode matches the running
instance count is updated. If the instance count is in the range of where
to drop (based on starting and # of times), then drop the command by not
passing it to the target layer.

Signed-off-by: James Smart <james.smart@broadcom.com>
3 years agoMerge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md...
Jens Axboe [Mon, 30 Nov 2020 22:52:19 +0000 (15:52 -0700)]
Merge branch 'md-next' of https://git./linux/kernel/git/song/md into for-5.11/drivers

Pull MD changes from Song:

"Summary:
 1. Fix race condition in md_ioctl(), by Dae R. Jeong;
 2. Initialize read_slot properly for raid10, by Kevin Vigor;
 3. Code cleanup, by Pankaj Gupta;
 4. md-cluster resync/reshape fix, by Zhao Heming."

* 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  md/cluster: fix deadlock when node is doing resync job
  md/cluster: block reshape with remote resync job
  md: use current request time as base for ktime comparisons
  md: add comments in md_flush_request()
  md: improve variable names in md_flush_request()
  md/raid10: initialize r10_bio->read_slot before use.
  md: fix a warning caused by a race between concurrent md_ioctl()s

3 years agomd/cluster: fix deadlock when node is doing resync job
Zhao Heming [Thu, 19 Nov 2020 11:41:34 +0000 (19:41 +0800)]
md/cluster: fix deadlock when node is doing resync job

md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg.
During sending msg, node can concurrently receive msg from another node.
When node does resync job, grab token_lockres:EX may trigger a deadlock:
```
nodeA                       nodeB
--------------------     --------------------
a.
send METADATA_UPDATED
held token_lockres:EX
                         b.
                         md_do_sync
                          resync_info_update
                            send RESYNCING
                             + set MD_CLUSTER_SEND_LOCK
                             + wait for holding token_lockres:EX

                         c.
                         mdadm /dev/md0 --remove /dev/sdg
                          + held reconfig_mutex
                          + send REMOVE
                             + wait_event(MD_CLUSTER_SEND_LOCK)

                         d.
                         recv_daemon //METADATA_UPDATED from A
                          process_metadata_update
                           + (mddev_trylock(mddev) ||
                              MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
                             //this time, both return false forever
```
Explaination:
a. A send METADATA_UPDATED
   This will block another node to send msg

b. B does sync jobs, which will send RESYNCING at intervals.
   This will be block for holding token_lockres:EX lock.

c. B do "mdadm --remove", which will send REMOVE.
   This will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1.

d. B recv METADATA_UPDATED msg, which send from A in step <a>.
   This will be blocked by step <c>: holding mddev lock, it makes
   wait_event can't hold mddev lock. (btw,
   MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.)

There is a similar deadlock in commit 0ba959774e93
("md-cluster: use sync way to handle METADATA_UPDATED msg")
In that commit, step c is "update sb". This patch step c is
"mdadm --remove".

For fixing this issue, we can refer the solution of function:
metadata_update_start. Which does the same grab lock_token action.
lock_comm can use the same steps to avoid deadlock. By moving
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm.
It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
but it is safe & can break deadlock.

Repro steps (I only triggered 3 times with hundreds tests):

two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB.
```
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done

mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
 --bitmap-chunk=1M
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mkfs.xfs /dev/md0
mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
mdadm --grow --raid-devices=2 /dev/md0
```

test script will hung when executing "mdadm --remove".

```
 # dump stacks by "echo t > /proc/sysrq-trigger"
md0_cluster_rec D    0  5329      2 0x80004000
Call Trace:
 __schedule+0x1f6/0x560
 ? _cond_resched+0x2d/0x40
 ? schedule+0x4a/0xb0
 ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster]
 ? wait_woken+0x80/0x80
 ? process_recvd_msg+0x113/0x1d0 [md_cluster]
 ? recv_daemon+0x9e/0x120 [md_cluster]
 ? md_thread+0x94/0x160 [md_mod]
 ? wait_woken+0x80/0x80
 ? md_congested+0x30/0x30 [md_mod]
 ? kthread+0x115/0x140
 ? __kthread_bind_mask+0x60/0x60
 ? ret_from_fork+0x1f/0x40

mdadm           D    0  5423      1 0x00004004
Call Trace:
 __schedule+0x1f6/0x560
 ? __schedule+0x1fe/0x560
 ? schedule+0x4a/0xb0
 ? lock_comm.isra.0+0x7b/0xb0 [md_cluster]
 ? wait_woken+0x80/0x80
 ? remove_disk+0x4f/0x90 [md_cluster]
 ? hot_remove_disk+0xb1/0x1b0 [md_mod]
 ? md_ioctl+0x50c/0xba0 [md_mod]
 ? wait_woken+0x80/0x80
 ? blkdev_ioctl+0xa2/0x2a0
 ? block_ioctl+0x39/0x40
 ? ksys_ioctl+0x82/0xc0
 ? __x64_sys_ioctl+0x16/0x20
 ? do_syscall_64+0x5f/0x150
 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

md0_resync      D    0  5425      2 0x80004000
Call Trace:
 __schedule+0x1f6/0x560
 ? schedule+0x4a/0xb0
 ? dlm_lock_sync+0xa1/0xd0 [md_cluster]
 ? wait_woken+0x80/0x80
 ? lock_token+0x2d/0x90 [md_cluster]
 ? resync_info_update+0x95/0x100 [md_cluster]
 ? raid1_sync_request+0x7d3/0xa40 [raid1]
 ? md_do_sync.cold+0x737/0xc8f [md_mod]
 ? md_thread+0x94/0x160 [md_mod]
 ? md_congested+0x30/0x30 [md_mod]
 ? kthread+0x115/0x140
 ? __kthread_bind_mask+0x60/0x60
 ? ret_from_fork+0x1f/0x40
```

At last, thanks for Xiao's solution.

Cc: stable@vger.kernel.org
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Suggested-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
3 years agomd/cluster: block reshape with remote resync job
Zhao Heming [Thu, 19 Nov 2020 11:41:33 +0000 (19:41 +0800)]
md/cluster: block reshape with remote resync job

Reshape request should be blocked with ongoing resync job. In cluster
env, a node can start resync job even if the resync cmd isn't executed
on it, e.g., user executes "mdadm --grow" on node A, sometimes node B
will start resync job. However, current update_raid_disks() only check
local recovery status, which is incomplete. As a result, we see user will
execute "mdadm --grow" successfully on local, while the remote node deny
to do reshape job when it doing resync job. The inconsistent handling
cause array enter unexpected status. If user doesn't observe this issue
and continue executing mdadm cmd, the array doesn't work at last.

Fix this issue by blocking reshape request. When node executes "--grow"
and detects ongoing resync, it should stop and report error to user.

The following script reproduces the issue with ~100% probability.
(two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB)
```
 # on node1, node2 is the remote node.
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done

mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
mdadm --grow --raid-devices=2 /dev/md0
```

Cc: stable@vger.kernel.org
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
3 years agomd: use current request time as base for ktime comparisons
Pankaj Gupta [Wed, 11 Nov 2020 05:16:58 +0000 (06:16 +0100)]
md: use current request time as base for ktime comparisons

Request coalescing logic uses 'prev_flush_start' as base to
compare the current request start time. 'prev_flush_start' is
updated in other context.

This patch changes this by using ktime comparison base to
'req_start' for better readability of code.

Signed-off-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
3 years agomd: add comments in md_flush_request()
Pankaj Gupta [Wed, 11 Nov 2020 05:16:57 +0000 (06:16 +0100)]
md: add comments in md_flush_request()

Request coalescing logic is dependent on flush time update in other
context. This patch adds comments to understand the code flow better.

Signed-off-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
3 years agomd: improve variable names in md_flush_request()
Pankaj Gupta [Wed, 11 Nov 2020 05:16:56 +0000 (06:16 +0100)]
md: improve variable names in md_flush_request()

This patch improves readability by using better variable names
in flush request coalescing logic.

Signed-off-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
3 years agomd/raid10: initialize r10_bio->read_slot before use.
Kevin Vigor [Fri, 6 Nov 2020 22:20:34 +0000 (14:20 -0800)]
md/raid10: initialize r10_bio->read_slot before use.

In __make_request() a new r10bio is allocated and passed to
raid10_read_request(). The read_slot member of the bio is not
initialized, and the raid10_read_request() uses it to index an
array. This leads to occasional panics.

Fix by initializing the field to invalid value and checking for
valid value in raid10_read_request().

Cc: stable@vger.kernel.org
Signed-off-by: Kevin Vigor <kvigor@gmail.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
3 years agomd: fix a warning caused by a race between concurrent md_ioctl()s
Dae R. Jeong [Thu, 22 Oct 2020 01:21:28 +0000 (10:21 +0900)]
md: fix a warning caused by a race between concurrent md_ioctl()s

Syzkaller reports a warning as belows.
WARNING: CPU: 0 PID: 9647 at drivers/md/md.c:7169
...
Call Trace:
...
RIP: 0010:md_ioctl+0x4017/0x5980 drivers/md/md.c:7169
RSP: 0018:ffff888096027950 EFLAGS: 00010293
RAX: ffff88809322c380 RBX: 0000000000000932 RCX: ffffffff84e266f2
RDX: 0000000000000000 RSI: ffffffff84e299f7 RDI: 0000000000000007
RBP: ffff888096027bc0 R08: ffff88809322c380 R09: ffffed101341a482
R10: ffff888096027940 R11: ffff88809a0d240f R12: 0000000000000932
R13: ffff8880a2c14100 R14: ffff88809a0d2268 R15: ffff88809a0d2408
 __blkdev_driver_ioctl block/ioctl.c:304 [inline]
 blkdev_ioctl+0xece/0x1c10 block/ioctl.c:606
 block_ioctl+0xee/0x130 fs/block_dev.c:1930
 vfs_ioctl fs/ioctl.c:46 [inline]
 file_ioctl fs/ioctl.c:509 [inline]
 do_vfs_ioctl+0xd5f/0x1380 fs/ioctl.c:696
 ksys_ioctl+0xab/0xd0 fs/ioctl.c:713
 __do_sys_ioctl fs/ioctl.c:720 [inline]
 __se_sys_ioctl fs/ioctl.c:718 [inline]
 __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

This is caused by a race between two concurrenct md_ioctl()s closing
the array.
CPU1 (md_ioctl())                   CPU2 (md_ioctl())
------                              ------
set_bit(MD_CLOSING, &mddev->flags);
did_set_md_closing = true;
                                    WARN_ON_ONCE(test_bit(MD_CLOSING,
                                            &mddev->flags));
if(did_set_md_closing)
    clear_bit(MD_CLOSING, &mddev->flags);

Fix the warning by returning immediately if the MD_CLOSING bit is set
in &mddev->flags which indicates that the array is being closed.

Fixes: 065e519e71b2 ("md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop")
Reported-by: syzbot+1e46a0864c1a6e9bd3d8@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Dae R. Jeong <dae.r.jeong@kaist.ac.kr>
Signed-off-by: Song Liu <songliubraving@fb.com>
3 years agos390/dasd: Process FCES path event notification
Jan Höppner [Thu, 8 Oct 2020 13:13:36 +0000 (15:13 +0200)]
s390/dasd: Process FCES path event notification

If the Fibre Channel Endpoint-Security status of a path changes, a
corresponding path event is received from the CIO layer.

Process this event by re-reading the FCES information.

As the information is retrieved for all paths on a single CU in one
call, the internal status can also be updated for all paths and no
processing per path is necessary.

Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agos390/dasd: Prepare for additional path event handling
Jan Höppner [Thu, 8 Oct 2020 13:13:35 +0000 (15:13 +0200)]
s390/dasd: Prepare for additional path event handling

As more path events need to be handled for ECKD the current path
verification infrastructure can be reused. Rename all path verifcation
code to fit the more broadly based task of path event handling and put
the path verification in a new separate function.

Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agos390/dasd: Display FC Endpoint Security information via sysfs
Jan Höppner [Thu, 8 Oct 2020 13:13:34 +0000 (15:13 +0200)]
s390/dasd: Display FC Endpoint Security information via sysfs

Add a new sysfs attribute (fc_security) per device and per operational
channel path. The information of the current FC Endpoint Security state
is received through the CIO layer.

The state of the FC Endpoint Security can be either "Unsupported",
"Authentication", or "Encryption".

For example:
$ cat /sys/bus/ccw/devices/0.0.c600/fc_security
Encryption

If any of the operational paths is in a state different from all
others, the device sysfs attribute will display the additional state
"Inconsistent".

The sysfs attributes per paths are organised in a new directory called
"paths_info" with subdirectories for each path.

/sys/bus/ccw/devices/0.0.c600/paths_info/
├── 0.38
│   └── fc_security
├── 0.39
│   └── fc_security
├── 0.3a
│   └── fc_security
└── 0.3b
    └── fc_security

Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agos390/dasd: Fix operational path inconsistency
Jan Höppner [Thu, 8 Oct 2020 13:13:33 +0000 (15:13 +0200)]
s390/dasd: Fix operational path inconsistency

During online processing and setting up a DASD device, the configuration
data for operational paths is read and validated two times
(dasd_eckd_read_conf()). The first time to provide information that are
necessary for the LCU setup. A second time after the LCU setup as a
device might report different configuration data then.

When the configuration setup for each operational path is being
validated, an initial call to dasd_eckd_clear_conf_data() is issued.
This call wipes all previously available configuration data and path
information for each path.
However, the operational path mask is not updated during this process.

As a result, the stored operational path mask might no longer correspond
to the operational paths mask reported by the CIO layer, as several
paths might be gone between the two dasd_eckd_read_conf() calls.

This inconsistency leads to more severe issues in later path handling
changes. Fix this by removing the channel paths from the operational
path mask during the dasd_eckd_clear_conf_data() call.

Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agos390/dasd: Store path configuration data during path handling
Jan Höppner [Thu, 8 Oct 2020 13:13:32 +0000 (15:13 +0200)]
s390/dasd: Store path configuration data during path handling

Currently, the configuration data for a path is retrieved during a path
verification and used only temporarily. If a path is newly added to the
I/O setup after a boot, no configuration data will be stored for this
particular path.
However, this data is required for later use and should be present for
a valid I/O path anyway. Store this data during the path verification so
that newly added paths can provide all information necessary.

[sth@linux.ibm.com: fix conf_data memleak]

Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agos390/dasd: Move duplicate code to separate function
Jan Höppner [Thu, 8 Oct 2020 13:13:31 +0000 (15:13 +0200)]
s390/dasd: Move duplicate code to separate function

For storing retrieved path information both the if and else block in
dasd_eckd_read_conf() use the same code. To avoid duplicate code this
should be done after the if/else block. To further increase readability,
move the code to a new function, dasd_eckd_store_conf_data().

Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agos390/dasd: Remove unused parameter from dasd_generic_probe()
Jan Höppner [Thu, 8 Oct 2020 13:13:30 +0000 (15:13 +0200)]
s390/dasd: Remove unused parameter from dasd_generic_probe()

The discipline argument in dasd_generic_probe() isn't used and there is
no history how it was used in the past. Remove it.

Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agos390/cio: Add support for FCES status notification
Vineeth Vijayan [Thu, 8 Oct 2020 13:13:29 +0000 (15:13 +0200)]
s390/cio: Add support for FCES status notification

Fibre Channel Endpoint-Security event is received as an sei:nt0 type
in the CIO layer. This information needs to be shared with the
CCW device drivers using the path_events callback.

Co-developed-by: Sebastian Ott <sebott@linux.ibm.com>
Signed-off-by: Vineeth Vijayan <vneethv@linux.ibm.com>
Signed-off-by: Sebastian Ott <sebott@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agos390/cio: Provide Endpoint-Security Mode per CU
Vineeth Vijayan [Thu, 8 Oct 2020 13:13:28 +0000 (15:13 +0200)]
s390/cio: Provide Endpoint-Security Mode per CU

Add an interface in the CIO layer to retrieve the information about the
Endpoint-Security Mode (ESM) of the specified CU. The ESM values are
defined as 0-None, 1-Authenticated or 2, 3-Encrypted.

[vneethv@linux.ibm.com: cleaned-up and modified description]

Signed-off-by: Sebastian Ott <sebott@linux.ibm.com>
Signed-off-by: Vineeth Vijayan <vneethv@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agos390/cio: Export information about Endpoint-Security Capability
Sebastian Ott [Thu, 8 Oct 2020 13:13:27 +0000 (15:13 +0200)]
s390/cio: Export information about Endpoint-Security Capability

Add a new sysfs attribute 'esc' per chpid. This new attribute exports
the Endpoint-Security-Capability byte of channel-path description block,
which could be 0-None, 1-Authentication, 2 and 3-Encryption.

For example:
$ cat /sys/devices/css0/chp0.34/esc
0

[vneethv@linux.ibm.com: cleaned-up & modified description]

Signed-off-by: Sebastian Ott <sebott@linux.ibm.com>
Signed-off-by: Vineeth Vijayan <vneethv@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Höppner <hoeppner@linux.ibm.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: fix the kerneldoc comment for __register_blkdev
Christoph Hellwig [Sat, 14 Nov 2020 17:08:21 +0000 (18:08 +0100)]
block: fix the kerneldoc comment for __register_blkdev

Switch the comment to talk about __register_blkdev instead of
register_blkdev and document the new probe parameter.

Fixes: 3da1a61e7046 ("block: add an optional probe callback to major_names")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: switch gendisk lookup to a simple xarray
Christoph Hellwig [Thu, 29 Oct 2020 14:58:41 +0000 (15:58 +0100)]
block: switch gendisk lookup to a simple xarray

Now that bdev_map is only used for finding gendisks, we can use
a simple xarray instead of the regions tracking structure for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoz2ram: use separate gendisk for the different modes
Christoph Hellwig [Thu, 29 Oct 2020 14:58:40 +0000 (15:58 +0100)]
z2ram: use separate gendisk for the different modes

Use separate gendisks (which share a tag_set) for the different operating
modes instead of redirecting the gendisk lookup using a probe callback.
This avoids potential problems with aliased block_device instances and
will eventually allow for removing the blk_register_region framework.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoz2ram: reindent
Christoph Hellwig [Thu, 29 Oct 2020 14:58:39 +0000 (15:58 +0100)]
z2ram: reindent

reindent the driver using Lident as the code style was far away from
normal Linux code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoataflop: use a separate gendisk for each media format
Christoph Hellwig [Thu, 29 Oct 2020 14:58:38 +0000 (15:58 +0100)]
ataflop: use a separate gendisk for each media format

The Atari floppy driver usually autodetects the media when used with the
ormal /dev/fd? devices, which also are the only nodes created by udev.
But it also supports various aliases that force a given media format.
That is currently supported using the blk_register_region framework
which finds the floppy gendisk even for a 'mismatched' dev_t.  The
problem with this (besides the code complexity) is that it creates
multiple struct block_device instances for the whole device of a
single gendisk, which can lead to interesting issues in code not
aware of that fact.

To fix this just create a separate gendisk for each of the aliases
if they are accessed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoamiflop: use separate gendisks for Amiga vs MS-DOS mode
Christoph Hellwig [Thu, 29 Oct 2020 14:58:37 +0000 (15:58 +0100)]
amiflop: use separate gendisks for Amiga vs MS-DOS mode

Use separate gendisks (which share a tag_set) for the native Amgiga vs
the MS-DOS mode instead of redirecting the gendisk lookup using a probe
callback.  This avoids potential problems with aliased block_device
instances and will eventually allow for removing the blk_register_region
framework.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agofloppy: use a separate gendisk for each media format
Christoph Hellwig [Thu, 29 Oct 2020 14:58:36 +0000 (15:58 +0100)]
floppy: use a separate gendisk for each media format

The floppy driver usually autodetects the media when used with the
normal /dev/fd? devices, which also are the only nodes created by udev.
But it also supports various aliases that force a given media format.
That is currently supported using the blk_register_region framework
which finds the floppy gendisk even for a 'mismatched' dev_t.  The
problem with this (besides the code complexity) is that it creates
multiple struct block_device instances for the whole device of a
single gendisk, which can lead to interesting issues in code not
aware of that fact.

To fix this just create a separate gendisk for each of the aliases
if they are accessed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoide: switch to __register_blkdev for command set probing
Christoph Hellwig [Thu, 29 Oct 2020 14:58:35 +0000 (15:58 +0100)]
ide: switch to __register_blkdev for command set probing

ide is the last user of the blk_register_region framework except for the
tracking of allocated gendisk.  Switch to __register_blkdev, even if that
doesn't allow us to trivially find out which command set to probe for.
That means we now always request all modules when a user tries to access
an unclaimed ide device node, but except for a few potentially loaded
modules for a fringe use case of a deprecated and soon to be removed
driver that doesn't make a difference.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agomd: use __register_blkdev to allocate devices on demand
Christoph Hellwig [Thu, 29 Oct 2020 14:58:34 +0000 (15:58 +0100)]
md: use __register_blkdev to allocate devices on demand

Use the simpler mechanism attached to major_name to allocate a md device
when a currently unregistered minor is accessed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoloop: use __register_blkdev to allocate devices on demand
Christoph Hellwig [Thu, 29 Oct 2020 14:58:33 +0000 (15:58 +0100)]
loop: use __register_blkdev to allocate devices on demand

Use the simpler mechanism attached to major_name to allocate a brd device
when a currently unregistered minor is accessed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agobrd: use __register_blkdev to allocate devices on demand
Christoph Hellwig [Thu, 29 Oct 2020 14:58:32 +0000 (15:58 +0100)]
brd: use __register_blkdev to allocate devices on demand

Use the simpler mechanism attached to major_name to allocate a brd device
when a currently unregistered minor is accessed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agosd: use __register_blkdev to avoid a modprobe for an unregistered dev_t
Christoph Hellwig [Thu, 29 Oct 2020 14:58:31 +0000 (15:58 +0100)]
sd: use __register_blkdev to avoid a modprobe for an unregistered dev_t

Switch from using blk_register_region to the probe callback passed to
__register_blkdev to disable the request_module call for an unclaimed
dev_t in the SD majors.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoswim: don't call blk_register_region
Christoph Hellwig [Thu, 29 Oct 2020 14:58:30 +0000 (15:58 +0100)]
swim: don't call blk_register_region

The swim driver (unlike various other floppy drivers) doesn't have
magic device nodes for certain modes, and already registers a gendisk
for each of the floppies supported by a device.  Thus the region
registered is a no-op and can be removed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoide: remove ide_{,un}register_region
Christoph Hellwig [Thu, 29 Oct 2020 14:58:29 +0000 (15:58 +0100)]
ide: remove ide_{,un}register_region

There is no need to ever register the fake gendisk used for ide-tape.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: add an optional probe callback to major_names
Christoph Hellwig [Thu, 29 Oct 2020 14:58:28 +0000 (15:58 +0100)]
block: add an optional probe callback to major_names

Add a callback to the major_names array that allows a driver to override
how to probe for dev_t that doesn't currently have a gendisk registered.
This will help separating the lookup of the gendisk by dev_t vs probe
action for a not currently registered dev_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: rework requesting modules for unclaimed devices
Christoph Hellwig [Thu, 29 Oct 2020 14:58:27 +0000 (15:58 +0100)]
block: rework requesting modules for unclaimed devices

Instead of reusing the ranges in bdev_map, add a new helper that is
called if no ranges was found.  This is a first step to unpeel and
eventually remove the complex ranges structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: split block_class_lock
Christoph Hellwig [Thu, 29 Oct 2020 14:58:26 +0000 (15:58 +0100)]
block: split block_class_lock

Split the block_class_lock mutex into one each to protect bdev_map
and major_names.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: open code kobj_map into in block/genhd.c
Christoph Hellwig [Thu, 29 Oct 2020 14:58:25 +0000 (15:58 +0100)]
block: open code kobj_map into in block/genhd.c

Copy and paste the kobj_map functionality in the block code in preparation
for completely rewriting it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: cleanup del_gendisk a bit
Christoph Hellwig [Thu, 29 Oct 2020 14:58:24 +0000 (15:58 +0100)]
block: cleanup del_gendisk a bit

Merge three hidden gendisk checks into one.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: remove __blkdev_driver_ioctl
Christoph Hellwig [Tue, 3 Nov 2020 10:00:18 +0000 (11:00 +0100)]
block: remove __blkdev_driver_ioctl

Just open code it in the few callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: remove set_device_ro
Christoph Hellwig [Tue, 3 Nov 2020 10:00:17 +0000 (11:00 +0100)]
block: remove set_device_ro

Fold set_device_ro into its only remaining caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoloop: use set_disk_ro
Christoph Hellwig [Tue, 3 Nov 2020 10:00:16 +0000 (11:00 +0100)]
loop: use set_disk_ro

Use set_disk_ro instead of set_device_ro to match all other block
drivers and to ensure all partitions mirror the read-only flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: don't call into the driver for BLKROSET
Christoph Hellwig [Tue, 3 Nov 2020 10:00:15 +0000 (11:00 +0100)]
block: don't call into the driver for BLKROSET

Now that all drivers that want to hook into setting or clearing the
read-only flag use the set_read_only method, this code can be removed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agodasd: implement ->set_read_only to hook into BLKROSET processing
Christoph Hellwig [Tue, 3 Nov 2020 10:00:14 +0000 (11:00 +0100)]
dasd: implement ->set_read_only to hook into BLKROSET processing

Implement the ->set_read_only method instead of parsing the actual
ioctl command.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agomd: implement ->set_read_only to hook into BLKROSET processing
Christoph Hellwig [Tue, 3 Nov 2020 10:00:13 +0000 (11:00 +0100)]
md: implement ->set_read_only to hook into BLKROSET processing

Implement the ->set_read_only method instead of parsing the actual
ioctl command.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agorbd: implement ->set_read_only to hook into BLKROSET processing
Christoph Hellwig [Tue, 3 Nov 2020 10:00:12 +0000 (11:00 +0100)]
rbd: implement ->set_read_only to hook into BLKROSET processing

Implement the ->set_read_only method instead of parsing the actual
ioctl command.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: add a new set_read_only method
Christoph Hellwig [Tue, 3 Nov 2020 10:00:11 +0000 (11:00 +0100)]
block: add a new set_read_only method

Add a new method to allow for driver-specific processing when setting or
clearing the block device read-only state.  This allows to replace the
cumbersome and error-prone override of the whole ioctl implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: don't call into the driver for BLKFLSBUF
Christoph Hellwig [Tue, 3 Nov 2020 10:00:10 +0000 (11:00 +0100)]
block: don't call into the driver for BLKFLSBUF

BLKFLSBUF is entirely contained in the block core, and there is no
good reason to give the driver a hook into processing it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agomtd_blkdevs: don't override BLKFLSBUF
Christoph Hellwig [Tue, 3 Nov 2020 10:00:09 +0000 (11:00 +0100)]
mtd_blkdevs: don't override BLKFLSBUF

BLKFLSBUF is not supposed to actually send a flush command to the device,
but to tear down buffer cache structures.  Remove the mtd_blkdevs
implementation and just use the default semantics instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoLinux 5.10-rc4
Linus Torvalds [Mon, 16 Nov 2020 00:44:31 +0000 (16:44 -0800)]
Linux 5.10-rc4

3 years agoMerge tag 'drm-fixes-2020-11-16' of git://anongit.freedesktop.org/drm/drm
Linus Torvalds [Sun, 15 Nov 2020 21:07:36 +0000 (13:07 -0800)]
Merge tag 'drm-fixes-2020-11-16' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Dave Airlie:
 "Nouveau fixes:

   - atomic modesetting regression fix

   - ttm pre-nv50 fix

   - connector NULL ptr deref fix"

* tag 'drm-fixes-2020-11-16' of git://anongit.freedesktop.org/drm/drm:
  drm/nouveau/kms/nv50-: Use atomic encoder callbacks everywhere
  drm/nouveau/ttm: avoid using nouveau_drm.ttm.type_vram prior to nv50
  drm/nouveau/kms: Fix NULL pointer dereference in nouveau_connector_detect_depth

3 years agoMerge branch 'linux-5.10' of git://github.com/skeggsb/linux into drm-fixes
Dave Airlie [Sun, 15 Nov 2020 20:36:23 +0000 (06:36 +1000)]
Merge branch 'linux-5.10' of git://github.com/skeggsb/linux into drm-fixes

- atomic modesetting regression fix
- ttm pre-nv50 fix
- connector NULL ptr deref fix

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Ben Skeggs <skeggsb@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/CACAvsv5D9p78MNN0OxVeRZxN8LDqcadJEGUEFCgWJQ6+_rjPuw@mail.gmail.com
3 years agoMerge tag 'char-misc-5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregk...
Linus Torvalds [Sun, 15 Nov 2020 18:15:17 +0000 (10:15 -0800)]
Merge tag 'char-misc-5.10-rc4' of git://git./linux/kernel/git/gregkh/char-misc

Pull char/misc driver fixes from Greg KH:
 "Here are some small char/misc/whatever driver fixes for 5.10-rc4.

  Nothing huge, lots of small fixes for reported issues:

   - habanalabs driver fixes

   - speakup driver fixes

   - uio driver fixes

   - virtio driver fix

   - other tiny driver fixes

  Full details are in the shortlog.

  All of these have been in linux-next for a full week with no reported
  issues"

* tag 'char-misc-5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  uio: Fix use-after-free in uio_unregister_device()
  firmware: xilinx: fix out-of-bounds access
  nitro_enclaves: Fixup type and simplify logic of the poll mask setup
  speakup ttyio: Do not schedule() in ttyio_in_nowait
  speakup: Fix clearing selection in safe context
  speakup: Fix var_id_t values and thus keymap
  virtio: virtio_console: fix DMA memory allocation for rproc serial
  habanalabs/gaudi: mask WDT error in QMAN
  habanalabs/gaudi: move coresight mmu config
  habanalabs: fix kernel pointer type
  mei: protect mei_cl_mtu from null dereference

3 years agoMerge tag 'usb-5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Linus Torvalds [Sun, 15 Nov 2020 18:02:41 +0000 (10:02 -0800)]
Merge tag 'usb-5.10-rc4' of git://git./linux/kernel/git/gregkh/usb

Pull USB and Thunderbolt fixes from Greg KH:
 "Here are some small Thunderbolt and USB driver fixes for 5.10-rc4 to
  solve some reported issues.

  Nothing huge in here, just small things:

   - thunderbolt memory leaks fixed and new device ids added

   - revert of problem patch for the musb driver

   - new quirks added for USB devices

   - typec power supply fixes to resolve much reported problems about
     charging notifications not working anymore

  All except the cdc-acm driver quirk addition have been in linux-next
  with no reported issues (the quirk patch was applied on Friday, and is
  self-contained)"

* tag 'usb-5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  usb: cdc-acm: Add DISABLE_ECHO for Renesas USB Download mode
  MAINTAINERS: add usb raw gadget entry
  usb: typec: ucsi: Report power supply changes
  xhci: hisilicon: fix refercence leak in xhci_histb_probe
  Revert "usb: musb: convert to devm_platform_ioremap_resource_byname"
  thunderbolt: Add support for Intel Tiger Lake-H
  thunderbolt: Only configure USB4 wake for lane 0 adapters
  thunderbolt: Add uaccess dependency to debugfs interface
  thunderbolt: Fix memory leak if ida_simple_get() fails in enumerate_services()
  thunderbolt: Add the missed ida_simple_remove() in ring_request_msix()

3 years agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Linus Torvalds [Sun, 15 Nov 2020 17:57:58 +0000 (09:57 -0800)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "Fixes for ARM and x86, the latter especially for old processors
  without two-dimensional paging (EPT/NPT)"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  kvm: mmu: fix is_tdp_mmu_check when the TDP MMU is not in use
  KVM: SVM: Update cr3_lm_rsvd_bits for AMD SEV guests
  KVM: x86: Introduce cr3_lm_rsvd_bits in kvm_vcpu_arch
  KVM: x86: clflushopt should be treated as a no-op by emulation
  KVM: arm64: Handle SCXTNUM_ELx traps
  KVM: arm64: Unify trap handlers injecting an UNDEF
  KVM: arm64: Allow setting of ID_AA64PFR0_EL1.CSV2 from userspace

3 years agoMerge tag 'x86-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 15 Nov 2020 17:49:56 +0000 (09:49 -0800)]
Merge tag 'x86-urgent-2020-11-15' of git://git./linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
 "A small set of fixes for x86:

   - Cure the fallout from the MSI irqdomain overhaul which missed that
     the Intel IOMMU does not register virtual function devices and
     therefore never reaches the point where the MSI interrupt domain is
     assigned. This made the VF devices use the non-remapped MSI domain
     which is trapped by the IOMMU/remap unit

   - Remove an extra space in the SGI_UV architecture type procfs output
     for UV5

   - Remove a unused function which was missed when removing the UV BAU
     TLB shootdown handler"

* tag 'x86-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  iommu/vt-d: Cure VF irqdomain hickup
  x86/platform/uv: Fix copied UV5 output archtype
  x86/platform/uv: Drop last traces of uv_flush_tlb_others

3 years agoMerge tag 'perf-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 15 Nov 2020 17:46:36 +0000 (09:46 -0800)]
Merge tag 'perf-urgent-2020-11-15' of git://git./linux/kernel/git/tip/tip

Pull perf fixes from Thomas Gleixner:
 "A set of fixes for perf:

    - A set of commits which reduce the stack usage of various perf
      event handling functions which allocated large data structs on
      stack causing stack overflows in the worst case

    - Use the proper mechanism for detecting soft interrupts in the
      recursion protection

    - Make the resursion protection simpler and more robust

    - Simplify the scheduling of event groups to make the code more
      robust and prepare for fixing the issues vs. scheduling of
      exclusive event groups

    - Prevent event multiplexing and rotation for exclusive event groups

    - Correct the perf event attribute exclusive semantics to take
      pinned events, e.g. the PMU watchdog, into account

    - Make the anythread filtering conditional for Intel's generic PMU
      counters as it is not longer guaranteed to be supported on newer
      CPUs. Check the corresponding CPUID leaf to make sure

    - Fixup a duplicate initialization in an array which was probably
      caused by the usual 'copy & paste - forgot to edit' mishap"

* tag 'perf-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/uncore: Fix Add BW copypasta
  perf/x86/intel: Make anythread filter support conditional
  perf: Tweak perf_event_attr::exclusive semantics
  perf: Fix event multiplexing for exclusive groups
  perf: Simplify group_sched_in()
  perf: Simplify group_sched_out()
  perf/x86: Make dummy_iregs static
  perf/arch: Remove perf_sample_data::regs_user_copy
  perf: Optimize get_recursion_context()
  perf: Fix get_recursion_context()
  perf/x86: Reduce stack usage for x86_pmu::drain_pebs()
  perf: Reduce stack usage of perf_output_begin()

3 years agoMerge tag 'sched-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 15 Nov 2020 17:39:35 +0000 (09:39 -0800)]
Merge tag 'sched-urgent-2020-11-15' of git://git./linux/kernel/git/tip/tip

Pull scheduler fixes from Thomas Gleixner:
 "A set of scheduler fixes:

   - Address a load balancer regression by making the load balancer use
     the same logic as the wakeup path to spread tasks in the LLC domain

   - Prefer the CPU on which a task run last over the local CPU in the
     fast wakeup path for asymmetric CPU capacity systems to align with
     the symmetric case. This ensures more locality and prevents massive
     migration overhead on those asymetric systems

   - Fix a memory corruption bug in the scheduler debug code caused by
     handing a modified buffer pointer to kfree()"

* tag 'sched-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/debug: Fix memory corruption caused by multiple small reads of flags
  sched/fair: Prefer prev cpu in asymmetric wakeup path
  sched/fair: Ensure tasks spreading in LLC during LB

3 years agoMerge tag 'locking-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 15 Nov 2020 17:25:43 +0000 (09:25 -0800)]
Merge tag 'locking-urgent-2020-11-15' of git://git./linux/kernel/git/tip/tip

Pull locking fixes from Thomas Gleixner:
 "Two fixes for the locking subsystem:

   - Prevent an unconditional interrupt enable in a futex helper
     function which can be called from contexts which expect interrupts
     to stay disabled across the call

   - Don't modify lockdep chain keys in the validation process as that
     causes chain inconsistency"

* tag 'locking-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lockdep: Avoid to modify chain keys in validate_chain()
  futex: Don't enable IRQs unconditionally in put_pi_state()

3 years agoMerge branch 'for-5.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis...
Linus Torvalds [Sun, 15 Nov 2020 16:57:19 +0000 (08:57 -0800)]
Merge branch 'for-5.10-fixes' of git://git./linux/kernel/git/dennis/percpu

Pull percpu fix and cleanup from Dennis Zhou:
 "A fix for a Wshadow warning in the asm-generic percpu macros came in
  and then I tacked on the removal of flexible array initializers in the
  percpu allocator"

* 'for-5.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
  percpu: convert flexible array initializers to use struct_size()
  asm-generic: percpu: avoid Wshadow warning

3 years agokvm: mmu: fix is_tdp_mmu_check when the TDP MMU is not in use
Paolo Bonzini [Sun, 15 Nov 2020 13:55:43 +0000 (08:55 -0500)]
kvm: mmu: fix is_tdp_mmu_check when the TDP MMU is not in use

In some cases where shadow paging is in use, the root page will
be either mmu->pae_root or vcpu->arch.mmu->lm_root.  Then it will
not have an associated struct kvm_mmu_page, because it is allocated
with alloc_page instead of kvm_mmu_alloc_page.

Just return false quickly from is_tdp_mmu_root if the TDP MMU is
not in use, which also includes the case where shadow paging is
enabled.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>