Baolin Wang [Sun, 17 May 2020 11:49:41 +0000 (19:49 +0800)]
block: Remove unused flush_queue_delayed in struct blk_flush_queue
The flush_queue_delayed was introdued to hold queue if flush is
running for non-queueable flush drive by commit
3ac0cc450870
("hold queue if flush is running for non-queueable flush drive"),
but the non mq parts of the flush code had been removed by
commit
7e992f847a08 ("block: remove non mq parts from the flush code"),
as well as removing the usage of the flush_queue_delayed flag.
Thus remove the unused flush_queue_delayed flag.
Signed-off-by: Baolin Wang <baolin.wang7@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Tue, 19 May 2020 04:07:37 +0000 (21:07 -0700)]
null_blk: Zero-initialize read buffers in non-memory-backed mode
This patch suppresses an uninteresting KMSAN complaint without affecting
performance of the null_blk driver if CONFIG_KMSAN is disabled.
Reported-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Alexander Potapenko <glider@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Tue, 19 May 2020 04:07:36 +0000 (21:07 -0700)]
block: Document the bio_vec properties
Since it is nontrivial that nth_page() does not have to be used for a
bio_vec, document this.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
CC: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Tue, 19 May 2020 04:07:35 +0000 (21:07 -0700)]
bio.h: Declare the arguments of the bio iteration functions const
This change makes it possible to pass 'const struct bio *' arguments to
these functions.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Tue, 19 May 2020 04:07:34 +0000 (21:07 -0700)]
block: Fix type of first compat_put_{,u}long() argument
This patch fixes the following sparse warnings:
block/ioctl.c:209:16: warning: incorrect type in argument 1 (different address spaces)
block/ioctl.c:209:16: expected void const volatile [noderef] <asn:1> *
block/ioctl.c:209:16: got signed int [usertype] *argp
block/ioctl.c:214:16: warning: incorrect type in argument 1 (different address spaces)
block/ioctl.c:214:16: expected void const volatile [noderef] <asn:1> *
block/ioctl.c:214:16: got unsigned int [usertype] *argp
block/ioctl.c:666:40: warning: incorrect type in argument 1 (different address spaces)
block/ioctl.c:666:40: expected signed int [usertype] *argp
block/ioctl.c:666:40: got void [noderef] <asn:1> *argp
block/ioctl.c:672:41: warning: incorrect type in argument 1 (different address spaces)
block/ioctl.c:672:41: expected unsigned int [usertype] *argp
block/ioctl.c:672:41: got void [noderef] <asn:1> *argp
Fixes:
9b81648cb5e3 ("compat_ioctl: simplify up block/ioctl.c")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 13 May 2020 10:49:35 +0000 (12:49 +0200)]
block: merge part_{inc,dev}_in_flight into their only callers
part_inc_in_flight and part_dec_in_flight only have one caller each, and
those callers are purely for bio based drivers. Merge each function into
the only caller, and remove the superflous blk-mq checks.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 13 May 2020 10:49:34 +0000 (12:49 +0200)]
block: don't call part_{inc,dec}_in_flight for blk-mq devices
part_inc_in_flight and part_dec_in_flight are no-ops for blk-mq queues,
so remove the calls in purely blk-mq callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 13 May 2020 10:49:33 +0000 (12:49 +0200)]
block: move the blk-mq calls out of part_in_flight{,_rw}
Don't bother to call part_in_flight / part_in_flight_rw on blk-mq
devices, just call the blk-mq versions directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 13 May 2020 10:49:32 +0000 (12:49 +0200)]
block: mark blk_account_io_completion static
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Sat, 16 May 2020 18:28:01 +0000 (20:28 +0200)]
blk-mq: allow blk_mq_make_request to consume the q_usage_counter reference
blk_mq_make_request currently needs to grab an q_usage_counter
reference when allocating a request. This is because the block layer
grabs one before calling blk_mq_make_request, but also releases it as
soon as blk_mq_make_request returns. Remove the blk_queue_exit call
after blk_mq_make_request returns, and instead let it consume the
reference. This works perfectly fine for the block layer caller, just
device mapper needs an extra reference as the old problem still
persists there. Open code blk_queue_enter_live in device mapper,
as there should be no other callers and this allows better documenting
why we do a non-try get.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Sat, 16 May 2020 18:28:00 +0000 (20:28 +0200)]
blk-mq: remove a pointless queue enter pair in blk_mq_alloc_request_hctx
No need for two queue references.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Sat, 16 May 2020 18:27:59 +0000 (20:27 +0200)]
blk-mq: remove a pointless queue enter pair in blk_mq_alloc_request
No need for two queue references.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Sat, 16 May 2020 18:27:58 +0000 (20:27 +0200)]
blk-mq: move the call to blk_queue_enter_live out of blk_mq_get_request
Move the blk_queue_enter_live calls into the callers, where they can
successively be cleaned up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Wed, 13 May 2020 16:02:23 +0000 (18:02 +0200)]
blktrace: Report pid with note messages
Currently informational messages within block trace do not have PID
information of the process reporting the message included. With BFQ it
is sometimes useful to have the information and there's no good reason
to omit the information from the trace. So just fill in pid information
when generating note message.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 4 May 2020 16:10:05 +0000 (18:10 +0200)]
block: remove the REQ_NOWAIT_INLINE flag
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Satya Tangirala [Thu, 14 May 2020 00:37:20 +0000 (00:37 +0000)]
block: blk-crypto-fallback for Inline Encryption
Blk-crypto delegates crypto operations to inline encryption hardware
when available. The separately configurable blk-crypto-fallback contains
a software fallback to the kernel crypto API - when enabled, blk-crypto
will use this fallback for en/decryption when inline encryption hardware
is not available.
This lets upper layers not have to worry about whether or not the
underlying device has support for inline encryption before deciding to
specify an encryption context for a bio. It also allows for testing
without actual inline encryption hardware - in particular, it makes it
possible to test the inline encryption code in ext4 and f2fs simply by
running xfstests with the inlinecrypt mount option, which in turn allows
for things like the regular upstream regression testing of ext4 to cover
the inline encryption code paths.
For more details, refer to Documentation/block/inline-encryption.rst.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Satya Tangirala [Thu, 14 May 2020 00:37:19 +0000 (00:37 +0000)]
block: Make blk-integrity preclude hardware inline encryption
Whenever a device supports blk-integrity, make the kernel pretend that
the device doesn't support inline encryption (essentially by setting the
keyslot manager in the request queue to NULL).
There's no hardware currently that supports both integrity and inline
encryption. However, it seems possible that there will be such hardware
in the near future (like the NVMe key per I/O support that might support
both inline encryption and PI).
But properly integrating both features is not trivial, and without
real hardware that implements both, it is difficult to tell if it will
be done correctly by the majority of hardware that support both.
So it seems best not to support both features together right now, and
to decide what to do at probe time.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Satya Tangirala [Thu, 14 May 2020 00:37:18 +0000 (00:37 +0000)]
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Satya Tangirala [Thu, 14 May 2020 00:37:17 +0000 (00:37 +0000)]
block: Keyslot Manager for Inline Encryption
Inline Encryption hardware allows software to specify an encryption context
(an encryption key, crypto algorithm, data unit num, data unit size) along
with a data transfer request to a storage device, and the inline encryption
hardware will use that context to en/decrypt the data. The inline
encryption hardware is part of the storage device, and it conceptually sits
on the data path between system memory and the storage device.
Inline Encryption hardware implementations often function around the
concept of "keyslots". These implementations often have a limited number
of "keyslots", each of which can hold a key (we say that a key can be
"programmed" into a keyslot). Requests made to the storage device may have
a keyslot and a data unit number associated with them, and the inline
encryption hardware will en/decrypt the data in the requests using the key
programmed into that associated keyslot and the data unit number specified
with the request.
As keyslots are limited, and programming keys may be expensive in many
implementations, and multiple requests may use exactly the same encryption
contexts, we introduce a Keyslot Manager to efficiently manage keyslots.
We also introduce a blk_crypto_key, which will represent the key that's
programmed into keyslots managed by keyslot managers. The keyslot manager
also functions as the interface that upper layers will use to program keys
into inline encryption hardware. For more information on the Keyslot
Manager, refer to documentation found in block/keyslot-manager.c and
linux/keyslot-manager.h.
Co-developed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Satya Tangirala [Thu, 14 May 2020 00:37:16 +0000 (00:37 +0000)]
Documentation: Document the blk-crypto framework
The blk-crypto framework adds support for inline encryption. There are
numerous changes throughout the storage stack. This patch documents the
main design choices in the block layer, the API presented to users of
the block layer (like fscrypt or layered devices) and the API presented
to drivers for adding support for inline encryption.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Tue, 15 Oct 2019 00:18:11 +0000 (17:18 -0700)]
iocost: don't let vrate run wild while there's no saturation signal
When the QoS targets are met and nothing is being throttled, there's
no way to tell how saturated the underlying device is - it could be
almost entirely idle, at the cusp of saturation or anywhere inbetween.
Given that there's no information, it's best to keep vrate as-is in
this state. Before
7cd806a9a953 ("iocost: improve nr_lagging
handling"), this was the case - if the device isn't missing QoS
targets and nothing is being throttled, busy_level was reset to zero.
While fixing nr_lagging handling,
7cd806a9a953 ("iocost: improve
nr_lagging handling") broke this. Now, while the device is hitting
QoS targets and nothing is being throttled, vrate keeps getting
adjusted according to the existing busy_level.
This led to vrate keeping climing till it hits max when there's an IO
issuer with limited request concurrency if the vrate started low.
vrate starts getting adjusted upwards until the issuer can issue IOs
w/o being throttled. From then on, QoS targets keeps getting met and
nothing on the system needs throttling and vrate keeps getting
increased due to the existing busy_level.
This patch makes the following changes to the busy_level logic.
* Reset busy_level if nr_shortages is zero to avoid the above
scenario.
* Make non-zero nr_lagging block lowering nr_level but still clear
positive busy_level if there's clear non-saturation signal - QoS
targets are met and nr_shortages is non-zero. nr_lagging's role is
preventing adjusting vrate upwards while there are long-running
commands and it shouldn't keep busy_level positive while there's
clear non-saturation signal.
* Restructure code for clarity and add comments.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Andy Newell <newella@fb.com>
Fixes:
7cd806a9a953 ("iocost: improve nr_lagging handling")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Thu, 14 May 2020 08:45:09 +0000 (16:45 +0800)]
block: move blk_io_schedule() out of header file
blk_io_schedule() isn't called from performance sensitive code path, and
it is easier to maintain by exporting it as symbol.
Also blk_io_schedule() is only called by CONFIG_BLOCK code, so it is safe
to do this way. Meantime fixes build failure when CONFIG_BLOCK is off.
Cc: Christoph Hellwig <hch@infradead.org>
Fixes:
e6249cdd46e4 ("block: add blk_io_schedule() for avoiding task hung in sync dio")
Reported-by: Satya Tangirala <satyat@google.com>
Tested-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Johannes Thumshirn [Tue, 12 May 2020 08:55:54 +0000 (17:55 +0900)]
zonefs: use REQ_OP_ZONE_APPEND for sync DIO
Synchronous direct I/O to a sequential write only zone can be issued using
the new REQ_OP_ZONE_APPEND request operation. As dispatching multiple
BIOs can potentially result in reordering, we cannot support asynchronous
IO via this interface.
We also can only dispatch up to queue_max_zone_append_sectors() via the
new zone-append method and have to return a short write back to user-space
in case an IO larger than queue_max_zone_append_sectors() has been issued.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Johannes Thumshirn [Tue, 12 May 2020 08:55:53 +0000 (17:55 +0900)]
block: export bio_release_pages and bio_iov_iter_get_pages
Export bio_release_pages and bio_iov_iter_get_pages, so they can be used
from modular code.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Damien Le Moal [Tue, 12 May 2020 08:55:52 +0000 (17:55 +0900)]
null_blk: Support REQ_OP_ZONE_APPEND
Support REQ_OP_ZONE_APPEND requests for null_blk devices with zoned
mode enabled. Use the internally tracked zone write pointer position
as the actual write position and return it using the command request
__sector field in the case of an mq device and using the command BIO
sector in the case of a BIO device.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Johannes Thumshirn [Tue, 12 May 2020 08:55:51 +0000 (17:55 +0900)]
scsi: sd_zbc: emulate ZONE_APPEND commands
Emulate ZONE_APPEND for SCSI disks using a regular WRITE(16) command
with a start LBA set to the target zone write pointer position.
In order to always know the write pointer position of a sequential write
zone, the write pointer of all zones is tracked using an array of 32bits
zone write pointer offset attached to the scsi disk structure. Each
entry of the array indicate a zone write pointer position relative to
the zone start sector. The write pointer offsets are maintained in sync
with the device as follows:
1) the write pointer offset of a zone is reset to 0 when a
REQ_OP_ZONE_RESET command completes.
2) the write pointer offset of a zone is set to the zone size when a
REQ_OP_ZONE_FINISH command completes.
3) the write pointer offset of a zone is incremented by the number of
512B sectors written when a write, write same or a zone append
command completes.
4) the write pointer offset of all zones is reset to 0 when a
REQ_OP_ZONE_RESET_ALL command completes.
Since the block layer does not write lock zones for zone append
commands, to ensure a sequential ordering of the regular write commands
used for the emulation, the target zone of a zone append command is
locked when the function sd_zbc_prepare_zone_append() is called from
sd_setup_read_write_cmnd(). If the zone write lock cannot be obtained
(e.g. a zone append is in-flight or a regular write has already locked
the zone), the zone append command dispatching is delayed by returning
BLK_STS_ZONE_RESOURCE.
To avoid the need for write locking all zones for REQ_OP_ZONE_RESET_ALL
requests, use a spinlock to protect accesses and modifications of the
zone write pointer offsets. This spinlock is initialized from sd_probe()
using the new function sd_zbc_init().
Co-developed-by: Damien Le Moal <Damien.LeMoal@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Johannes Thumshirn [Tue, 12 May 2020 08:55:50 +0000 (17:55 +0900)]
scsi: sd_zbc: factor out sanity checks for zoned commands
Factor sanity checks for zoned commands from sd_zbc_setup_zone_mgmt_cmnd().
This will help with the introduction of an emulated ZONE_APPEND command.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Damien Le Moal [Tue, 12 May 2020 08:55:49 +0000 (17:55 +0900)]
block: Modify revalidate zones
Modify the interface of blk_revalidate_disk_zones() to add an optional
driver callback function that a driver can use to extend processing
done during zone revalidation. The callback, if defined, is executed
with the device request queue frozen, after all zones have been
inspected.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Johannes Thumshirn [Tue, 12 May 2020 08:55:48 +0000 (17:55 +0900)]
block: introduce blk_req_zone_write_trylock
Introduce blk_req_zone_write_trylock(), which either grabs the write-lock
for a sequential zone or returns false, if the zone is already locked.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Tue, 12 May 2020 08:55:47 +0000 (17:55 +0900)]
block: Introduce REQ_OP_ZONE_APPEND
Define REQ_OP_ZONE_APPEND to append-write sectors to a zone of a zoned
block device. This is a no-merge write operation.
A zone append write BIO must:
* Target a zoned block device
* Have a sector position indicating the start sector of the target zone
* The target zone must be a sequential write zone
* The BIO must not cross a zone boundary
* The BIO size must not be split to ensure that a single range of LBAs
is written with a single command.
Implement these checks in generic_make_request_checks() using the
helper function blk_check_zone_append(). To avoid write append BIO
splitting, introduce the new max_zone_append_sectors queue limit
attribute and ensure that a BIO size is always lower than this limit.
Export this new limit through sysfs and check these limits in bio_full().
Also when a LLDD can't dispatch a request to a specific zone, it
will return BLK_STS_ZONE_RESOURCE indicating this request needs to
be delayed, e.g. because the zone it will be dispatched to is still
write-locked. If this happens set the request aside in a local list
to continue trying dispatching requests such as READ requests or a
WRITE/ZONE_APPEND requests targetting other zones. This way we can
still keep a high queue depth without starving other requests even if
one request can't be served due to zone write-locking.
Finally, make sure that the bio sector position indicates the actual
write position as indicated by the device on completion.
Signed-off-by: Keith Busch <kbusch@kernel.org>
[ jth: added zone-append specific add_page and merge_page helpers ]
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 12 May 2020 08:55:46 +0000 (17:55 +0900)]
block: rename __bio_add_pc_page to bio_add_hw_page
Rename __bio_add_pc_page() to bio_add_hw_page() and explicitly pass in a
max_sectors argument.
This max_sectors argument can be used to specify constraints from the
hardware.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[ jth: rebased and made public for blk-map.c ]
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Johannes Thumshirn [Tue, 12 May 2020 08:55:45 +0000 (17:55 +0900)]
block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no
blk_queue_zone_is_seq() and blk_queue_zone_no() have not been called with
CONFIG_BLK_DEV_ZONED disabled until now.
The introduction of REQ_OP_ZONE_APPEND will change this, so we need to
provide noop fallbacks for the !CONFIG_BLK_DEV_ZONED case.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sun, 3 May 2020 01:54:22 +0000 (09:54 +0800)]
block: add blk_io_schedule() for avoiding task hung in sync dio
Sync dio could be big, or may take long time in discard or in case of
IO failure.
We have prevented task hung in submit_bio_wait() and blk_execute_rq(),
so apply the same trick for prevent task hung from happening in sync dio.
Add helper of blk_io_schedule() and use io_schedule_timeout() to prevent
task hung warning.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Cc: Salman Qazi <sqazi@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 8 May 2020 08:17:58 +0000 (16:17 +0800)]
block: don't hold part0's refcount in IO path
gendisk can't be gone when there is IO activity, so not hold
part0's refcount in IO path.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Cc: Yufen Yu <yuyufen@huawei.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 8 May 2020 08:17:57 +0000 (16:17 +0800)]
block: re-organize fields of 'struct hd_part'
Put all fields accessed in IO path together at the beginning
of the struct, so that all can be fetched in single cacheline.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Cc: Yufen Yu <yuyufen@huawei.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 8 May 2020 08:17:56 +0000 (16:17 +0800)]
block: only define 'nr_sects_seq' in hd_part for 32bit SMP
The seqcount of 'nr_sects_seq' is only needed in case of 32bit SMP,
so define it just for 32bit SMP.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Cc: Yufen Yu <yuyufen@huawei.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 8 May 2020 08:17:55 +0000 (16:17 +0800)]
block: fix use-after-free on cached last_lookup partition
delete_partition() clears the cached last_lookup partition. However the
.last_lookup cache may be overwritten by one IO path after it is cleared
from delete_partition(). Then another IO path may use the cached deleting
partition after hd_struct_free() is called, then use-after-free is triggered
on the cached partition.
Fixes the issue by the following approach:
1) always get the partition's refcount via hd_struct_try_get() before
setting .last_lookup
2) move clearing .last_lookup from delete_partition() to hd_struct_free()
which is the release handle of the partition's percpu-refcount, so that no
IO path can cache deleteing partition via .last_lookup.
It is one candidate approach of Yufen's patch[1] which adds overhead
in fast path by indirect lookup which may introduce one extra cacheline
in IO path. Also this patch relies on percpu-refcount's protection, and
it is easier to understand and verify.
[1] https://lore.kernel.org/linux-block/
20200109013551.GB9655@ming.t460p/T/#t
Reported-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Weiping Zhang [Wed, 13 May 2020 00:44:05 +0000 (08:44 +0800)]
block: reset mapping if failed to update hardware queue count
When we increase hardware queue count, blk_mq_update_queue_map will
reset the mapping between cpu and hardware queue base on the hardware
queue count(set->nr_hw_queues). The mapping cannot be reset if it
encounters error in blk_mq_realloc_hw_ctxs, but the fallback flow will
continue using it, then blk_mq_map_swqueue will touch a invalid memory,
because the mapping points to a wrong hctx.
blktest block/030:
null_blk: module loaded
Increasing nr_hw_queues to 8 fails, fallback to 1
==================================================================
BUG: KASAN: null-ptr-deref in blk_mq_map_swqueue+0x2f2/0x830
Read of size 8 at addr
0000000000000128 by task nproc/8541
CPU: 5 PID: 8541 Comm: nproc Not tainted 5.7.0-rc4-dbg+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
Call Trace:
dump_stack+0xa5/0xe6
__kasan_report.cold+0x65/0xbb
kasan_report+0x45/0x60
check_memory_region+0x15e/0x1c0
__kasan_check_read+0x15/0x20
blk_mq_map_swqueue+0x2f2/0x830
__blk_mq_update_nr_hw_queues+0x3df/0x690
blk_mq_update_nr_hw_queues+0x32/0x50
nullb_device_submit_queues_store+0xde/0x160 [null_blk]
configfs_write_file+0x1c4/0x250 [configfs]
__vfs_write+0x4c/0x90
vfs_write+0x14b/0x2d0
ksys_write+0xdd/0x180
__x64_sys_write+0x47/0x50
do_syscall_64+0x6f/0x310
entry_SYSCALL_64_after_hwframe+0x49/0xb3
Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Tested-by: Bart van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stephen Rothwell [Mon, 11 May 2020 04:19:30 +0000 (14:19 +1000)]
bdi: fix up for "remove the name field in struct backing_dev_info"
Fixes:
1cd925d58385 ("bdi: remove the name field in struct backing_dev_info")
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 8 May 2020 08:18:28 +0000 (10:18 +0200)]
hfs: stop using ioctl_by_bdev
Instead just call the CDROM layer functionality directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 4 May 2020 12:48:01 +0000 (14:48 +0200)]
bdi: remove the name field in struct backing_dev_info
The name is only printed for a not registered bdi in writeback. Use the
device name there as is more useful anyway for the unlike case that the
warning triggers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 4 May 2020 12:48:00 +0000 (14:48 +0200)]
bdi: simplify bdi_alloc
Merge the _node vs normal version and drop the superflous gfp_t argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 4 May 2020 12:47:59 +0000 (14:47 +0200)]
bdi: remove bdi_register_owner
Split out a new bdi_set_owner helper to set the owner, and move the policy
for creating the bdi name back into genhd.c, where it belongs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 4 May 2020 12:47:58 +0000 (14:47 +0200)]
bdi: unexport bdi_register_va
bdi_register_va is only used by super.c, which can't be modular.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 4 May 2020 12:47:57 +0000 (14:47 +0200)]
driver core: remove device_create_vargs
All external users of device_create_vargs are gone, so remove it and
open code it in the only caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Weiping Zhang [Thu, 7 May 2020 13:04:42 +0000 (21:04 +0800)]
block: rename blk_mq_alloc_rq_maps
rename blk_mq_alloc_rq_maps to blk_mq_alloc_map_and_requests,
this function allocs both map and request, make function name align
with funtion.
Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Weiping Zhang [Thu, 7 May 2020 13:04:22 +0000 (21:04 +0800)]
block: rename __blk_mq_alloc_rq_map
rename __blk_mq_alloc_rq_map to __blk_mq_alloc_map_and_request,
actually it alloc both map and request, make function name
align with function.
Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Thu, 7 May 2020 13:04:08 +0000 (21:04 +0800)]
block: alloc map and request for new hardware queue
Alloc new map and request for new hardware queue when increse
hardware queue count. Before this patch, it will show a
warning for each new hardware queue, but it's not enough, these
hctx have no maps and reqeust, when a bio was mapped to these
hardware queue, it will trigger kernel panic when get request
from these hctx.
Test environment:
* A NVMe disk supports 128 io queues
* 96 cpus in system
A corner case can always trigger this panic, there are 96
io queues allocated for HCTX_TYPE_DEFAULT type, the corresponding kernel
log: nvme nvme0: 96/0/0 default/read/poll queues. Now we set nvme write
queues to 96, then nvme will alloc others(32) queues for read, but
blk_mq_update_nr_hw_queues does not alloc map and request for these new
added io queues. So when process read nvme disk, it will trigger kernel
panic when get request from these hardware context.
Reproduce script:
nr=$(expr `cat /sys/block/nvme0n1/device/queue_count` - 1)
echo $nr > /sys/module/nvme/parameters/write_queues
echo 1 > /sys/block/nvme0n1/device/reset_controller
dd if=/dev/nvme0n1 of=/dev/null bs=4K count=1
[ 8040.805626] ------------[ cut here ]------------
[ 8040.805627] WARNING: CPU: 82 PID: 12921 at block/blk-mq.c:2578 blk_mq_map_swqueue+0x2b6/0x2c0
[ 8040.805627] Modules linked in: nvme nvme_core nf_conntrack_netlink xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nf_conntrack_tftp nft_masq nf_tables_set nft_fib_inet nft_f
ib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack tun bridge nf_defrag_ipv6 nf_defrag_ipv4 stp llc ip6_tables ip_tables nft_compat rfkill ip_set nf_tables nfne
tlink sunrpc intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel intel_
cstate intel_uncore raid0 joydev intel_rapl_perf ipmi_si pcspkr mei_me ioatdma sg ipmi_devintf mei i2c_i801 dca lpc_ich ipmi_msghandler acpi_power_meter acpi_pad xfs libcrc32c sd_mod ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm d
rm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
[ 8040.805637] ahci drm i40e libahci crc32c_intel libata t10_pi wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nvme_core]
[ 8040.805640] CPU: 82 PID: 12921 Comm: kworker/u194:2 Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2
[ 8040.805640] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
[ 8040.805641] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[ 8040.805642] RIP: 0010:blk_mq_map_swqueue+0x2b6/0x2c0
[ 8040.805643] Code: 00 00 00 00 00 41 83 c5 01 44 39 6d 50 77 b8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b bb 98 00 00 00 89 d6 e8 8c 81 03 00 eb 83 <0f> 0b e9 52 ff ff ff 0f 1f 00 0f 1f 44 00 00 41 57 48 89 f1 41 56
[ 8040.805643] RSP: 0018:
ffffba590d2e7d48 EFLAGS:
00010246
[ 8040.805643] RAX:
0000000000000000 RBX:
ffff9f013e1ba800 RCX:
000000000000003d
[ 8040.805644] RDX:
ffff9f00ffff6000 RSI:
0000000000000003 RDI:
ffff9ed200246d90
[ 8040.805644] RBP:
ffff9f00f6a79860 R08:
0000000000000000 R09:
000000000000003d
[ 8040.805645] R10:
0000000000000001 R11:
ffff9f0138c3d000 R12:
ffff9f00fb3a9008
[ 8040.805645] R13:
000000000000007f R14:
ffffffff96822660 R15:
000000000000005f
[ 8040.805645] FS:
0000000000000000(0000) GS:
ffff9f013fa80000(0000) knlGS:
0000000000000000
[ 8040.805646] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 8040.805646] CR2:
00007f7f397fa6f8 CR3:
0000003d8240a002 CR4:
00000000007606e0
[ 8040.805647] DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
[ 8040.805647] DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
[ 8040.805647] PKRU:
55555554
[ 8040.805647] Call Trace:
[ 8040.805649] blk_mq_update_nr_hw_queues+0x31b/0x390
[ 8040.805650] nvme_reset_work+0xb4b/0xeab [nvme]
[ 8040.805651] process_one_work+0x1a7/0x370
[ 8040.805652] worker_thread+0x1c9/0x380
[ 8040.805653] ? max_active_store+0x80/0x80
[ 8040.805655] kthread+0x112/0x130
[ 8040.805656] ? __kthread_parkme+0x70/0x70
[ 8040.805657] ret_from_fork+0x35/0x40
[ 8040.805658] ---[ end trace
b5f13b1e73ccb5d3 ]---
[ 8229.365135] BUG: kernel NULL pointer dereference, address:
0000000000000004
[ 8229.365165] #PF: supervisor read access in kernel mode
[ 8229.365178] #PF: error_code(0x0000) - not-present page
[ 8229.365191] PGD 0 P4D 0
[ 8229.365201] Oops: 0000 [#1] SMP PTI
[ 8229.365212] CPU: 77 PID: 13024 Comm: dd Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2
[ 8229.365232] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
[ 8229.365253] RIP: 0010:blk_mq_get_tag+0x227/0x250
[ 8229.365265] Code: 44 24 04 44 01 e0 48 8b 74 24 38 65 48 33 34 25 28 00 00 00 75 33 48 83 c4 40 5b 5d 41 5c 41 5d 41 5e c3 48 8d 68 10 4c 89 ef <44> 8b 60 04 48 89 ee e8 dd f9 ff ff 83 f8 ff 75 c8 e9 67 fe ff ff
[ 8229.365304] RSP: 0018:
ffffba590e977970 EFLAGS:
00010246
[ 8229.365317] RAX:
0000000000000000 RBX:
ffff9f00f6a79860 RCX:
ffffba590e977998
[ 8229.365333] RDX:
0000000000000000 RSI:
ffff9f012039b140 RDI:
ffffba590e977a38
[ 8229.365349] RBP:
0000000000000010 R08:
ffffda58ff94e190 R09:
ffffda58ff94e198
[ 8229.365365] R10:
0000000000000011 R11:
ffff9f00f6a79860 R12:
0000000000000000
[ 8229.365381] R13:
ffffba590e977a38 R14:
ffff9f012039b140 R15:
0000000000000001
[ 8229.365397] FS:
00007f481c230580(0000) GS:
ffff9f013f940000(0000) knlGS:
0000000000000000
[ 8229.365415] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 8229.365428] CR2:
0000000000000004 CR3:
0000005f35e26004 CR4:
00000000007606e0
[ 8229.365444] DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
[ 8229.365460] DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
[ 8229.365476] PKRU:
55555554
[ 8229.365484] Call Trace:
[ 8229.365498] ? finish_wait+0x80/0x80
[ 8229.365512] blk_mq_get_request+0xcb/0x3f0
[ 8229.365525] blk_mq_make_request+0x143/0x5d0
[ 8229.365538] generic_make_request+0xcf/0x310
[ 8229.365553] ? scan_shadow_nodes+0x30/0x30
[ 8229.365564] submit_bio+0x3c/0x150
[ 8229.365576] mpage_readpages+0x163/0x1a0
[ 8229.365588] ? blkdev_direct_IO+0x490/0x490
[ 8229.365601] read_pages+0x6b/0x190
[ 8229.365612] __do_page_cache_readahead+0x1c1/0x1e0
[ 8229.365626] ondemand_readahead+0x182/0x2f0
[ 8229.365639] generic_file_buffered_read+0x590/0xab0
[ 8229.365655] new_sync_read+0x12a/0x1c0
[ 8229.365666] vfs_read+0x8a/0x140
[ 8229.365676] ksys_read+0x59/0xd0
[ 8229.365688] do_syscall_64+0x55/0x1d0
[ 8229.365700] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Tested-by: Weiping Zhang <zhangweiping@didiglobal.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Weiping Zhang [Thu, 7 May 2020 13:03:56 +0000 (21:03 +0800)]
block: save previous hardware queue count before udpate
blk_mq_realloc_tag_set_tags will update set->nr_hw_queues, so
save old set->nr_hw_queues before call this function.
Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Weiping Zhang [Thu, 7 May 2020 13:03:39 +0000 (21:03 +0800)]
block: free both rq_map and request
Allocation:
__blk_mq_alloc_rq_map
blk_mq_alloc_rq_map
blk_mq_alloc_rq_map
tags = blk_mq_init_tags : kzalloc_node:
tags->rqs = kcalloc_node
tags->static_rqs = kcalloc_node
blk_mq_alloc_rqs
p = alloc_pages_node
tags->static_rqs[i] = p + offset;
Free:
blk_mq_free_rq_map
kfree(tags->rqs);
kfree(tags->static_rqs);
blk_mq_free_tags
kfree(tags);
The page allocated in blk_mq_alloc_rqs cannot be released,
so we should use blk_mq_free_map_and_requests here.
blk_mq_free_map_and_requests
blk_mq_free_rqs
__free_pages : cleanup for blk_mq_alloc_rqs
blk_mq_free_rq_map : cleanup for blk_mq_alloc_rq_map
Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sat, 9 May 2020 22:13:58 +0000 (16:13 -0600)]
Merge branch 'block-5.7' into for-5.8/block
Pull in block-5.7 fixes for 5.8. Mostly to resolve a conflict with
the blk-iocost changes, but we also need the base of the bdi
use-after-free as well as we build on top of it.
* block-5.7:
nvme: fix possible hang when ns scanning fails during error recovery
nvme-pci: fix "slimmer CQ head update"
bdi: add a ->dev_name field to struct backing_dev_info
bdi: use bdi_dev_name() to get device name
bdi: move bdi_dev_name out of line
vboxsf: don't use the source name in the bdi name
iocost: protect iocg->abs_vdebt with iocg->waitq.lock
block: remove the bd_openers checks in blk_drop_partitions
nvme: prevent double free in nvme_alloc_ns() error handling
null_blk: Cleanup zoned device initialization
null_blk: Fix zoned command handling
block: remove unused header
blk-iocost: Fix error on iocost_ioc_vrate_adj
bdev: Reduce time holding bd_mutex in sync in blkdev_close()
buffer: remove useless comment and WB_REASON_FREE_MORE_MEM, reason.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Sagi Grimberg [Wed, 6 May 2020 22:44:02 +0000 (15:44 -0700)]
nvme: fix possible hang when ns scanning fails during error recovery
When the controller is reconnecting, the host fails I/O and admin
commands as the host cannot reach the controller. ns scanning may
revalidate namespaces during that period and it is wrong to remove
namespaces due to these failures as we may hang (see
205da2434301).
One command that may fail is nvme_identify_ns_descs. Since we return
success due to having ns identify descriptor list optional, we continue
to compare ns identifiers in nvme_revalidate_disk, obviously fail and
return -ENODEV to nvme_validate_ns, which will remove the namespace.
Exactly what we don't want to happen.
Fixes:
22802bf742c2 ("nvme: Namepace identification descriptor list is optional")
Tested-by: Anton Eidelman <anton@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Alexey Dobriyan [Thu, 7 May 2020 20:07:04 +0000 (23:07 +0300)]
nvme-pci: fix "slimmer CQ head update"
Pre-incrementing ->cq_head can't be done in memory because OOB value
can be observed by another context.
This devalues space savings compared to original code :-\
$ ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux
add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-32 (-32)
Function old new delta
nvme_poll_irqdisable 464 456 -8
nvme_poll 455 447 -8
nvme_irq 388 380 -8
nvme_dev_disable 955 947 -8
But the code is minimal now: one read for head, one read for q_depth,
one increment, one comparison, single instruction phase bit update and
one write for new head.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reported-by: John Garry <john.garry@huawei.com>
Tested-by: John Garry <john.garry@huawei.com>
Fixes:
e2a366a4b0feaeb ("nvme-pci: slimmer CQ head update")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 4 May 2020 12:47:56 +0000 (14:47 +0200)]
bdi: add a ->dev_name field to struct backing_dev_info
Cache a copy of the name for the life time of the backing_dev_info
structure so that we can reference it even after unregistering.
Fixes:
68f23b89067f ("memcg: fix a crash in wb_workfn when a device disappears")
Reported-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yufen Yu [Mon, 4 May 2020 12:47:55 +0000 (14:47 +0200)]
bdi: use bdi_dev_name() to get device name
Use the common interface bdi_dev_name() to get device name.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Add missing <linux/backing-dev.h> include BFQ
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 4 May 2020 12:47:54 +0000 (14:47 +0200)]
bdi: move bdi_dev_name out of line
bdi_dev_name is not a fast path function, move it out of line. This
prepares for using it from modular callers without having to export
an implementation detail like bdi_unknown_name.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 4 May 2020 12:47:53 +0000 (14:47 +0200)]
vboxsf: don't use the source name in the bdi name
Simplify the bdi name to mirror what we are doing elsewhere, and
drop them name in favor of just using a number. This avoids a
potentially very long bdi name.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Mon, 4 May 2020 23:27:54 +0000 (19:27 -0400)]
iocost: protect iocg->abs_vdebt with iocg->waitq.lock
abs_vdebt is an atomic_64 which tracks how much over budget a given cgroup
is and controls the activation of use_delay mechanism. Once a cgroup goes
over budget from forced IOs, it has to pay it back with its future budget.
The progress guarantee on debt paying comes from the iocg being active -
active iocgs are processed by the periodic timer, which ensures that as time
passes the debts dissipate and the iocg returns to normal operation.
However, both iocg activation and vdebt handling are asynchronous and a
sequence like the following may happen.
1. The iocg is in the process of being deactivated by the periodic timer.
2. A bio enters ioc_rqos_throttle(), calls iocg_activate() which returns
without anything because it still sees that the iocg is already active.
3. The iocg is deactivated.
4. The bio from #2 is over budget but needs to be forced. It increases
abs_vdebt and goes over the threshold and enables use_delay.
5. IO control is enabled for the iocg's subtree and now IOs are attributed
to the descendant cgroups and the iocg itself no longer issues IOs.
This leaves the iocg with stuck abs_vdebt - it has debt but inactive and no
further IOs which can activate it. This can end up unduly punishing all the
descendants cgroups.
The usual throttling path has the same issue - the iocg must be active while
throttled to ensure that future event will wake it up - and solves the
problem by synchronizing the throttling path with a spinlock. abs_vdebt
handling is another form of overage handling and shares a lot of
characteristics including the fact that it isn't in the hottest path.
This patch fixes the above and other possible races by strictly
synchronizing abs_vdebt and use_delay handling with iocg->waitq.lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Vlad Dmitriev <vvd@fb.com>
Cc: stable@vger.kernel.org # v5.4+
Fixes:
e1518f63f246 ("blk-iocost: Don't let merges push vtime into the future")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Sat, 25 Apr 2020 07:57:06 +0000 (09:57 +0200)]
udf: stop using ioctl_by_bdev
Instead just call the CDROM layer functionality directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Sat, 25 Apr 2020 07:57:05 +0000 (09:57 +0200)]
isofs: stop using ioctl_by_bdev
Instead just call the CDROM layer functionality directly, and turn the
hot mess in isofs_get_last_session into remotely readable code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Sat, 25 Apr 2020 07:57:04 +0000 (09:57 +0200)]
hfsplus: stop using ioctl_by_bdev
Instead just call the CDROM layer functionality directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Sat, 25 Apr 2020 07:57:03 +0000 (09:57 +0200)]
cdrom: factor out a cdrom_multisession helper
Factor out a version of the CDROMMULTISESSION ioctl handler that can
be called directly from kernel space.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Sat, 25 Apr 2020 07:57:02 +0000 (09:57 +0200)]
cdrom: factor out a cdrom_read_tocentry helper
Factor out a version of the CDROMREADTOCENTRY ioctl handler that can
be called directly from kernel space.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Sat, 25 Apr 2020 07:57:01 +0000 (09:57 +0200)]
ide-cd: rename cdrom_read_tocentry
Give the cdrom_read_tocentry function and ide_ prefix to not conflict
with the soon to be added generic function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Sat, 25 Apr 2020 07:57:00 +0000 (09:57 +0200)]
block: add a cdrom_device_info pointer to struct gendisk
Add a pointer to the CDROM information structure to struct gendisk.
This will allow various removable media file systems to call directly
into the CDROM layer instead of abusing ioctls with kernel pointers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Mon, 13 Apr 2020 16:27:58 +0000 (12:27 -0400)]
iocost_monitor: drop string wrap around numbers when outputting json
Wrapping numbers in strings is used by some to work around bit-width issues in
some enviroments. The problem isn't innate to json and the workaround seems to
cause more integration problems than help. Let's drop the string wrapping.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Mon, 13 Apr 2020 16:27:57 +0000 (12:27 -0400)]
iocost_monitor: exit successfully if interval is zero
This is to help external tools to decide whether iocost_monitor has all its
requirements met or not based on the exit status of an -i0 run.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Mon, 13 Apr 2020 16:27:56 +0000 (12:27 -0400)]
blk-iocost: account for IO size when testing latencies
On each IO completion, iocost decides whether the IO met or missed its latency
target. Currently, the targets are fixed numbers per IO type. While this can be
good enough for loose latency targets way higher than typical completion
latencies, the effect of IO size makes it difficult to tighten the latency
target - a target adequate for 4k IOs might be too tight for 512k IOs and
vice-versa.
iocost already has all the necessary information to account for different IO
sizes when testing whether the latency target is met as iocost can calculate the
size vtime cost of a given IO. This patch updates the completion path to
calculate the size vtime cost of the IO, deduct the nsec equivalent from the
observed latency and use the adjusted value to decide whether the target is met.
This makes latency targets independent from IO size and enables determining
adequate latency targets with fixed size fio runs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Mon, 13 Apr 2020 16:27:55 +0000 (12:27 -0400)]
blk-iocost: switch to fixed non-auto-decaying use_delay
The use_delay mechanism was introduced by blk-iolatency to hold memory
allocators accountable for the reclaim and other shared IOs they cause. The
duration of the delay is dynamically balanced between iolatency increasing the
value on each target miss and it auto-decaying as time passes and threads get
delayed on it.
While this works well for iolatency, iocost's control model isn't compatible
with it. There is no repeated "violation" events which can be balanced against
auto-decaying. iocost instead knows how much a given cgroup is over budget and
wants to prevent that cgroup from issuing IOs while over budget. Until now,
iocost has been adding the cost of force-issued IOs. However, this doesn't
reflect the amount which is already over budget and is simply not enough to
counter the auto-decaying allowing anon-memory leaking low priority cgroup to
go over its alloted share of IOs.
As auto-decaying doesn't make much sense for iocost, this patch introduces a
different mode of operation for use_delay - when blkcg_set_delay() are used
insted of blkcg_add/use_delay(), the delay duration is not auto-decayed until it
is explicitly cleared with blkcg_clear_delay(). iocost is updated to keep the
delay duration synchronized to the budget overage amount.
With this change, iocost can effectively police cgroups which generate
significant amount of force-issued IOs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 28 Apr 2020 08:52:03 +0000 (10:52 +0200)]
block: remove the bd_openers checks in blk_drop_partitions
When replacing the bd_super check with a bd_openers I followed a logical
conclusion, which turns out to be utterly wrong. When a block device has
bd_super sets it has a mount file system on it (although not every
mounted file system sets bd_super), but that also implies it doesn't even
have partitions to start with.
So instead of trying to come up with a logical check for all openers,
just remove the check entirely.
Fixes:
d3ef5536274f ("block: fix busy device checking in blk_drop_partitions")
Fixes:
cb6b771b05c3 ("block: fix busy device checking in blk_drop_partitions again")
Reported-by: Michal Koutný <mkoutny@suse.com>
Reported-by: Yang Xu <xuyang2018.jy@cn.fujitsu.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 30 Apr 2020 15:32:48 +0000 (09:32 -0600)]
Merge branch 'nvme-5.7' of git://git.infradead.org/nvme into block-5.7
Pull NVMe fix from Christoph.
* 'nvme-5.7' of git://git.infradead.org/nvme:
nvme: prevent double free in nvme_alloc_ns() error handling
Christoph Hellwig [Tue, 28 Apr 2020 11:27:56 +0000 (13:27 +0200)]
block: add a bio_queue_enter helper
Add a little helper that passes the right nowait flag to blk_queue_enter
based on the bio flag, and terminates the bio with the right error code
if entering the queue fails.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 28 Apr 2020 11:27:55 +0000 (13:27 +0200)]
block: replace BIO_QUEUE_ENTERED with BIO_CGROUP_ACCT
BIO_QUEUE_ENTERED is only used for cgroup accounting now, so rename
the flag and move setting it into the cgroup code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 28 Apr 2020 11:27:54 +0000 (13:27 +0200)]
block: cleanup the memory stall accounting in submit_bio
Instead of a convoluted chain just check for REQ_OP_READ directly,
and keep all the memory stall code together in a single unlikely
branch.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 28 Apr 2020 11:27:53 +0000 (13:27 +0200)]
block: improve the submit_bio and generic_make_request documentation
The current documentation is a little weird, as it doesn't clearly
explain which function to use, and also has the guts of the information
on generic_make_request, which is the internal interface for stacking
drivers.
Fix this up by properly documenting submit_bio, and only documenting
the differences and the use case for generic_make_request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Zheng Bin [Wed, 29 Apr 2020 01:36:32 +0000 (09:36 +0800)]
blk-mq: make function '__blk_mq_sched_dispatch_requests' static
Fix sparse warnings:
block/blk-mq-sched.c:209:5: warning: symbol '__blk_mq_sched_dispatch_requests' was not declared. Should it be static?
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Niklas Cassel [Mon, 27 Apr 2020 12:34:41 +0000 (14:34 +0200)]
nvme: prevent double free in nvme_alloc_ns() error handling
When jumping to the out_put_disk label, we will call put_disk(), which will
trigger a call to disk_release(), which calls blk_put_queue().
Later in the cleanup code, we do blk_cleanup_queue(), which will also call
blk_put_queue().
Putting the queue twice is incorrect, and will generate a KASAN splat.
Set the disk->queue pointer to NULL, before calling put_disk(), so that the
first call to blk_put_queue() will not free the queue.
The second call to blk_put_queue() uses another pointer to the same queue,
so this call will still free the queue.
Fixes:
85136c010285 ("lightnvm: simplify geometry enumeration")
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig [Sat, 25 Apr 2020 07:53:36 +0000 (09:53 +0200)]
block: bypass ->make_request_fn for blk-mq drivers
Call blk_mq_make_request when no ->make_request_fn is set. This is
safe now that blk_alloc_queue always sets up the pointer for make_request
based drivers. This avoids an indirect call in the blk-mq driver I/O
fast path, which is rather expensive due to spectre mitigations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Sat, 25 Apr 2020 07:53:35 +0000 (09:53 +0200)]
dm: remove the make_request_fn check in device_area_is_invalid
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Sat, 25 Apr 2020 07:53:34 +0000 (09:53 +0200)]
bcache: remove a duplicate ->make_request_fn assignment
The make_request_fn pointer should only be assigned by blk_alloc_queue.
Fix a left over manual initialization.
Fixes:
ff27668ce809 ("bcache: pass the make_request methods to blk_queue_make_request")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Sat, 25 Apr 2020 07:55:51 +0000 (09:55 +0200)]
block: remove create_io_context
create_io_context just has a single caller, which also happens to not
even use the return value. Just open code it there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Salman Qazi [Fri, 24 Apr 2020 15:03:21 +0000 (08:03 -0700)]
block: Limit number of items taken from the I/O scheduler in one go
Flushes bypass the I/O scheduler and get added to hctx->dispatch
in blk_mq_sched_bypass_insert. This can happen while a kworker is running
hctx->run_work work item and is past the point in
blk_mq_sched_dispatch_requests where hctx->dispatch is checked.
The blk_mq_do_dispatch_sched call is not guaranteed to end in bounded time,
because the I/O scheduler can feed an arbitrary number of commands.
Since we have only one hctx->run_work, the commands waiting in
hctx->dispatch will wait an arbitrary length of time for run_work to be
rerun.
A similar phenomenon exists with dispatches from the software queue.
The solution is to poll hctx->dispatch in blk_mq_do_dispatch_sched and
blk_mq_do_dispatch_ctx and return from the run_work handler and let it
rerun.
Signed-off-by: Salman Qazi <sqazi@google.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 24 Apr 2020 10:56:34 +0000 (12:56 +0200)]
block: unexport bdev_read_page and bdev_write_page
Each one just has two callers, both in always built-in code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Damien Le Moal [Thu, 23 Apr 2020 03:02:38 +0000 (12:02 +0900)]
null_blk: Cleanup zoned device initialization
Move all zoned mode related code from null_blk_main.c to
null_blk_zoned.c, avoiding an ugly #ifdef in the process.
Rename null_zone_init() into null_init_zoned_dev(), null_zone_exit()
into null_free_zoned_dev() and add the new function
null_register_zoned_dev() to finalize the zoned dev setup before
add_disk().
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Damien Le Moal [Thu, 23 Apr 2020 03:02:37 +0000 (12:02 +0900)]
null_blk: Fix zoned command handling
For write operations issued to a null_blk device with zoned mode
enabled, the state and write pointer position of the zone targeted by
the command should be checked before badblocks and memory backing
are handled as the write may be first failed due to, for instance, a
sector position not aligned with the zone write pointer. This order of
checking for errors reflects more accuratly the behavior of physical
zoned devices.
Furthermore, the write pointer position of the target zone should be
incremented only and only if no errors are reported by badblocks and
memory backing handling.
To fix this, introduce the small helper function null_process_cmd()
which execute null_handle_badblocks() and null_handle_memory_backed()
and use this function in null_zone_write() to correctly handle write
requests to zoned null devices depending on the type and state of the
write target zone. Also call this function in null_handle_zoned() to
process read requests to zoned null devices.
null_process_cmd() is called directly from null_handle_cmd() for
regular null devices, resulting in no functional change for these type
of devices. To have symmetric names, the function null_handle_zoned()
is renamed to null_process_zoned_cmd().
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 14 Apr 2020 07:42:25 +0000 (09:42 +0200)]
block: move dma_pad handling from blk_rq_map_sg into the callers
There are only two callers of blk_rq_map_sg/__blk_rq_map_sg that set
the dma_pad value in the queue. Move the handling into those callers
instead of burdening the common code, and move the ->extra_len field
from struct request to struct scsi_cmnd.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 14 Apr 2020 07:42:24 +0000 (09:42 +0200)]
block: move dma drain handling to scsi
Don't burden the common block code with with specifics of the libata DMA
draining mechanism. Instead move most of the code to the scsi midlayer.
That also means the nr_phys_segments adjustments in the blk-mq fast path
can go away entirely, given that SCSI never looks at nr_phys_segments
after mapping the request to a scatterlist.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 14 Apr 2020 07:42:23 +0000 (09:42 +0200)]
scsi: merge scsi_init_sgtable into scsi_init_io
scsi_init_io is the only caller of scsi_init_sgtable. Merge the two
function to make upcoming changes a little easier.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 14 Apr 2020 07:42:22 +0000 (09:42 +0200)]
block: provide a blk_rq_map_sg variant that returns the last element
To be able to move some of the special purpose hacks in blk_rq_map_sg
into the callers we need a variant that returns the last mapped
S/G list element to the caller. Add that variant as __blk_rq_map_sg
and make blk_rq_map_sg a trivial inline wrapper around it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 14 Apr 2020 07:42:21 +0000 (09:42 +0200)]
block: remove RQF_COPY_USER
The RQF_COPY_USER is set for bio where the passthrough request mapping
helpers decided that bounce buffering is required. It is then used to
pad scatterlist for drivers that required it. But given that
non-passthrough requests are per definition aligned, and directly mapped
pass-through request must be aligned it is not actually required at all.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ma, Jianpeng [Tue, 21 Apr 2020 11:48:08 +0000 (11:48 +0000)]
block: remove unused header
Dax related code already removed from this file.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Waiman Long [Tue, 21 Apr 2020 13:07:55 +0000 (09:07 -0400)]
blk-iocost: Fix error on iocost_ioc_vrate_adj
Systemtap 4.2 is unable to correctly interpret the "u32 (*missed_ppm)[2]"
argument of the iocost_ioc_vrate_adj trace entry defined in
include/trace/events/iocost.h leading to the following error:
/tmp/stapAcz0G0/stap_c89c58b83cea1724e26395efa9ed4939_6321_aux_6.c:78:8:
error: expected ‘;’, ‘,’ or ‘)’ before ‘*’ token
, u32[]* __tracepoint_arg_missed_ppm
That argument type is indeed rather complex and hard to read. Looking
at block/blk-iocost.c. It is just a 2-entry u32 array. By simplifying
the argument to a simple "u32 *missed_ppm" and adjusting the trace
entry accordingly, the compilation error was gone.
Fixes:
7caa47151ab2 ("blkcg: implement blk-iocost")
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 14 Apr 2020 07:29:02 +0000 (09:29 +0200)]
block: fold bdev_unhash_inode into invalidate_partition
invalidate_partition and bdev_unhash_inode are always paired, and
invalidate_partition already does an icache lookup for the block device
inode. Piggy back on that to remove the inode from the hash.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 14 Apr 2020 07:29:01 +0000 (09:29 +0200)]
block: mark invalidate_partition static
invalidate_partition is only used in genhd.c, so mark it static. Also
drop the return value given that is is always ignored.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 14 Apr 2020 07:29:00 +0000 (09:29 +0200)]
block: simplify block device syncing in bdev_del_partition
We just checked a little above that the block device for the partition
im busy. That implies no file system is mounted, and thus the only
thing in fsync_bdev that actually is used is sync_blockdev. Just call
sync_blockdev directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 14 Apr 2020 07:28:59 +0000 (09:28 +0200)]
block: don't call invalidate_partition from blk_drop_partitions
Given that the device must not be busy, most of the calls from
invalidate_partition that are related to file system metadata are
guranteed to not happen. Just open code the calls to sync_blockdev
and invalidate_bdev instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 14 Apr 2020 07:28:58 +0000 (09:28 +0200)]
dasd: use blk_drop_partitions instead of badly reimplementing it
Use the blk_drop_partitions function instead of messing around with
ioctls that get kernel pointers. For this blk_drop_partitions needs
to be exported, which it normally shouldn't - make an exception for
s390 only.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 14 Apr 2020 07:28:57 +0000 (09:28 +0200)]
block: remove the disk argument from blk_drop_partitions
The gendisk can be trivially deducted from the block_device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 14 Apr 2020 07:28:56 +0000 (09:28 +0200)]
block: remove hd_struct_kill
The function has a single caller, so just open code it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 14 Apr 2020 07:28:55 +0000 (09:28 +0200)]
block: cleanup hd_struct freeing
Move hd_ref_init out of line as there it isn't anywhere near a fast path,
and rename the rcu ref freeing callbacks to be more descriptive.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>