platform/kernel/linux-starfive.git
5 years agoblock: genhd: remove async_events field
Martin Wilck [Wed, 27 Mar 2019 13:51:01 +0000 (14:51 +0100)]
block: genhd: remove async_events field

The async_events field, intended to be used for drivers that support
asynchronous notifications about disk events (aka media change events),
isn't currently used by any driver, and apparently that has been that
way for a long time (if not forever). Remove it.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: only allow contiguous page structs in a bio_vec
Christoph Hellwig [Thu, 11 Apr 2019 06:23:31 +0000 (08:23 +0200)]
block: only allow contiguous page structs in a bio_vec

We currently have to call nth_page when iterating over pages inside a
bio_vec.  Jens complained a while ago that this is fairly expensive.
To mitigate this we can check that that the actual page structures
are contiguous when adding them to the bio, and just do check pointer
arithmetics later on.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: change how we get page references in bio_iov_iter_get_pages
Christoph Hellwig [Thu, 11 Apr 2019 06:23:30 +0000 (08:23 +0200)]
block: change how we get page references in bio_iov_iter_get_pages

Instead of needing a special macro to iterate over all pages in
a bvec just do a second passs over the whole bio.  This also matches
what we do on the release side.  The release side helper is moved
up to where we need the get helper to clearly express the symmetry.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: don't allow multiple bio_iov_iter_get_pages calls per bio
Christoph Hellwig [Thu, 11 Apr 2019 06:23:29 +0000 (08:23 +0200)]
block: don't allow multiple bio_iov_iter_get_pages calls per bio

No caller uses bio_iov_iter_get_pages multiple times on a given bio,
and that funtionality isn't all that useful.  Removing it will make
some future changes a little easier and also simplifies the function
a bit.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: refactor __bio_iov_bvec_add_pages
Christoph Hellwig [Thu, 11 Apr 2019 06:23:28 +0000 (08:23 +0200)]
block: refactor __bio_iov_bvec_add_pages

Return early on error, and add an unlikely annotation for that case.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: rewrite blk_bvec_map_sg to avoid a nth_page call
Christoph Hellwig [Thu, 11 Apr 2019 06:23:27 +0000 (08:23 +0200)]
block: rewrite blk_bvec_map_sg to avoid a nth_page call

The offset in scatterlists is allowed to be larger than the page size,
so don't go to great length to avoid that case and simplify the
arithmetics.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoMerge branch 'md-next' of https://github.com/liu-song-6/linux into for-5.2/block
Jens Axboe [Wed, 10 Apr 2019 22:42:02 +0000 (16:42 -0600)]
Merge branch 'md-next' of https://github.com/liu-song-6/linux into for-5.2/block

Pull MD changes from Song.

* 'md-next' of https://github.com/liu-song-6/linux:
  md: add __acquires/__releases annotations to handle_active_stripes
  md: add __acquires/__releases annotations to (un)lock_two_stripes
  md: mark md_cluster_mod static
  md: use correct type in super_1_sync
  md: use correct type in super_1_load
  md: use correct types in md_bitmap_print_sb
  md: add a missing endianness conversion in check_sb_changes
  md: add mddev->pers to avoid potential NULL pointer dereference

5 years agomd: add __acquires/__releases annotations to handle_active_stripes
Christoph Hellwig [Thu, 4 Apr 2019 16:56:16 +0000 (18:56 +0200)]
md: add __acquires/__releases annotations to handle_active_stripes

This tells sparse that we release and reacquire the device_lock and
avoids a warning.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
5 years agomd: add __acquires/__releases annotations to (un)lock_two_stripes
Christoph Hellwig [Thu, 4 Apr 2019 16:56:15 +0000 (18:56 +0200)]
md: add __acquires/__releases annotations to (un)lock_two_stripes

This tells sparse that we acquire/release the two stripe locks and
avoids a warning.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
5 years agomd: mark md_cluster_mod static
Christoph Hellwig [Thu, 4 Apr 2019 16:56:14 +0000 (18:56 +0200)]
md: mark md_cluster_mod static

Sparse complains that it has no external declaration, and it turns out
that it is never even used outside of md.c.  So just mark it static
and drop the export.

Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
5 years agomd: use correct type in super_1_sync
Christoph Hellwig [Thu, 4 Apr 2019 16:56:13 +0000 (18:56 +0200)]
md: use correct type in super_1_sync

If we want to convert from a little endian format we need to cast
to a little endian type, otherwise sparse will be unhappy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
5 years agomd: use correct type in super_1_load
Christoph Hellwig [Thu, 4 Apr 2019 16:56:12 +0000 (18:56 +0200)]
md: use correct type in super_1_load

If we want to convert from a little endian format we need to cast
to a little endian type, otherwise sparse will be unhappy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
5 years agomd: use correct types in md_bitmap_print_sb
Christoph Hellwig [Thu, 4 Apr 2019 16:56:11 +0000 (18:56 +0200)]
md: use correct types in md_bitmap_print_sb

If we want to convert from a little endian format we need to cast
to a little endian type, otherwise sparse will be unhappy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
5 years agomd: add a missing endianness conversion in check_sb_changes
Christoph Hellwig [Thu, 4 Apr 2019 16:56:10 +0000 (18:56 +0200)]
md: add a missing endianness conversion in check_sb_changes

The on-disk value is little endian and we need to convert it to
native endian before storing the value in the in-core structure.

Fixes: 7564beda19b36 ("md-cluster/raid10: support add disk under grow mode")
Cc: <stable@vger.kernel.org> # 4.20+
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
5 years agomd: add mddev->pers to avoid potential NULL pointer dereference
Yufen Yu [Tue, 2 Apr 2019 06:22:14 +0000 (14:22 +0800)]
md: add mddev->pers to avoid potential NULL pointer dereference

When doing re-add, we need to ensure rdev->mddev->pers is not NULL,
which can avoid potential NULL pointer derefence in fallowing
add_bound_rdev().

Fixes: a6da4ef85cef ("md: re-add a failed disk")
Cc: Xiao Ni <xni@redhat.com>
Cc: NeilBrown <neilb@suse.com>
Cc: <stable@vger.kernel.org> # 4.4+
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
5 years agoblock: null: Add documentation for "zone_nr_conv" param
Minwoo Im [Sun, 7 Apr 2019 08:19:38 +0000 (17:19 +0900)]
block: null: Add documentation for "zone_nr_conv" param

zone_nr_conv module parameter was introduced by a commit
ea2c18e1("null_blk: Add conventional zone configuration for zoned
support").

This patch describes "zone_nr_conv" module parameter to the document.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: fix build warning in merging bvecs
Ming Lei [Tue, 2 Apr 2019 02:26:44 +0000 (10:26 +0800)]
block: fix build warning in merging bvecs

Commit f6970f83ef79 ("block: don't check if adjacent bvecs in one bio can
be mergeable") changes bvec merge by only considering two bvecs from
different bios. However, if the former bio doesn't inlcude any io bvec,
then the following warning may be triggered:

 warning: ‘bvec.bv_offset’ may be used uninitialized in this function [-Wmaybe-uninitialized]

In practice, it shouldn't be triggered.

Fixes it by adding check on former bio, the check shouldn't add any cost
given 'bio->bi_iter' can be hit in cache.

Reported-by: Jens Axboe <axboe@kernel.dk>
Fixes: f6970f83ef79 ("block: don't check if adjacent bvecs in one bio can be mergeable")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock, bfq: fix some typos in comments
Angelo Ruocco [Mon, 8 Apr 2019 15:35:34 +0000 (17:35 +0200)]
block, bfq: fix some typos in comments

Some of the comments in the bfq files had typos. This patch fixes them.

Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: remove unused variable 'def'
Hisao Tanabe [Sun, 7 Apr 2019 15:27:42 +0000 (00:27 +0900)]
block: remove unused variable 'def'

The 'def' local variable became unused after commit f382fb0bcef4 ("block: remove
legacy IO schedulers"), let's remove it.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hisao Tanabe <xtanabe@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agovirtio_blk: replace 0 by HCTX_TYPE_DEFAULT to index blk_mq_tag_set->map
Dongli Zhang [Tue, 12 Mar 2019 01:31:56 +0000 (09:31 +0800)]
virtio_blk: replace 0 by HCTX_TYPE_DEFAULT to index blk_mq_tag_set->map

Use HCTX_TYPE_DEFAULT instead of 0 to avoid hardcoding.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: sed-opal: rename next to execute_steps
David Kozub [Thu, 14 Feb 2019 00:16:08 +0000 (01:16 +0100)]
block: sed-opal: rename next to execute_steps

As the function is responsible for executing the individual steps supplied
in the steps argument, execute_steps is a more descriptive name than the
rather generic next.

Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: sed-opal: don't repeat opal_discovery0 in each steps array
David Kozub [Thu, 14 Feb 2019 00:16:07 +0000 (01:16 +0100)]
block: sed-opal: don't repeat opal_discovery0 in each steps array

Originally each of the opal functions that call next include
opal_discovery0 in the array of steps. This is superfluous and
can be done always inside next.

Acked-by: Jon Derrick <jonathan.derrick@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: sed-opal: pass steps via argument rather than via opal_dev
David Kozub [Thu, 14 Feb 2019 00:16:06 +0000 (01:16 +0100)]
block: sed-opal: pass steps via argument rather than via opal_dev

The steps argument is only read by the next function, so it can
be passed directly as an argument rather than via opal_dev.

Normally, the steps is an array on the stack, so the pointer stops
being valid then the function that set opal_dev.steps returns.
If opal_dev.steps was not set to NULL before return it would become
a dangling pointer. When the steps are passed as argument this
becomes easier to see and more difficult to misuse.

Acked-by: Jon Derrick <jonathan.derrick@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: sed-opal: use named Opal tokens instead of integer literals
David Kozub [Thu, 14 Feb 2019 00:16:05 +0000 (01:16 +0100)]
block: sed-opal: use named Opal tokens instead of integer literals

Replace integer literals by Opal tokens defined in opal_proto.h where
possible.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jon Derrick <jonathan.derrick@intel.com>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: sed-opal: unify retrieval of table columns
David Kozub [Thu, 14 Feb 2019 00:16:04 +0000 (01:16 +0100)]
block: sed-opal: unify retrieval of table columns

Instead of having multiple places defining the same argument list to get
a specific column of a sed-opal table, provide a generic version and
call it from those functions.

Co-authored-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: sed-opal: add token for OPAL_LIFECYCLE
David Kozub [Thu, 14 Feb 2019 00:16:03 +0000 (01:16 +0100)]
block: sed-opal: add token for OPAL_LIFECYCLE

Define OPAL_LIFECYCLE token and use it instead of literals in
get_lsp_lifecycle.

Acked-by: Jon Derrick <jonathan.derrick@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: sed-opal: split generation of bytestring header and content
Jonas Rabenstein [Thu, 14 Feb 2019 00:16:02 +0000 (01:16 +0100)]
block: sed-opal: split generation of bytestring header and content

Split the header generation from the (normal) memcpy part if a
bytestring is copied into the command buffer. This allows in-place
generation of the bytestring content. For example, copy_from_user may be
used without an intermediate buffer.

Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: sed-opal: print failed function address
Jonas Rabenstein [Thu, 14 Feb 2019 00:16:01 +0000 (01:16 +0100)]
block: sed-opal: print failed function address

Add function address (and if available its symbol) to the message if a
step function fails.

Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: sed-opal: reuse response_get_token to decrease code duplication
David Kozub [Thu, 14 Feb 2019 00:16:00 +0000 (01:16 +0100)]
block: sed-opal: reuse response_get_token to decrease code duplication

response_get_token had already been in place, its functionality had
been duplicated within response_get_{u64,bytestring} with the same error
handling. Unify the handling by reusing response_get_token within the
other functions.

Co-authored-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: sed-opal: unify error handling of responses
David Kozub [Thu, 14 Feb 2019 00:15:59 +0000 (01:15 +0100)]
block: sed-opal: unify error handling of responses

response_get_{string,u64} include error handling for argument resp being
NULL but response_get_token does not handle this.

Make all three of response_get_{string,u64,token} handle NULL resp in
the same way.

Co-authored-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: sed-opal: unify cmd start
David Kozub [Thu, 14 Feb 2019 00:15:58 +0000 (01:15 +0100)]
block: sed-opal: unify cmd start

Every step starts with resetting the cmd buffer as well as the comid and
constructs the appropriate OPAL_CALL command. Consequently, those
actions may be combined into one generic function. On should take care
that the opening and closing tokens for the argument list are already
emitted by cmd_start and cmd_finalize respectively and thus must not be
additionally added.

Co-authored-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: sed-opal: close parameter list in cmd_finalize
David Kozub [Thu, 14 Feb 2019 00:15:57 +0000 (01:15 +0100)]
block: sed-opal: close parameter list in cmd_finalize

Every step ends by calling cmd_finalize (via finalize_and_send)
yet every step adds the token OPAL_ENDLIST on its own. Moving
this into cmd_finalize decreases code duplication.

Co-authored-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: sed-opal: unify space check in add_token_*
Jonas Rabenstein [Thu, 14 Feb 2019 00:15:56 +0000 (01:15 +0100)]
block: sed-opal: unify space check in add_token_*

All add_token_* functions have a common set of conditions that have to
be checked. Use a common function for those checks in order to avoid
different behaviour as well as code duplication.

Acked-by: Jon Derrick <jonathan.derrick@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Co-authored-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: sed-opal: use correct macro for method length
Jonas Rabenstein [Thu, 14 Feb 2019 00:15:55 +0000 (01:15 +0100)]
block: sed-opal: use correct macro for method length

Also the values of OPAL_UID_LENGTH and OPAL_METHOD_LENGTH are the same,
it is weird to use OPAL_UID_LENGTH for the definition of the methods.

Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: sed-opal: fix typos and formatting
David Kozub [Thu, 14 Feb 2019 00:15:54 +0000 (01:15 +0100)]
block: sed-opal: fix typos and formatting

This should make no change in functionality.
The formatting changes were triggered by checkpatch.pl.

Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: sed-opal: fix IOC_OPAL_ENABLE_DISABLE_MBR
David Kozub [Thu, 14 Feb 2019 00:15:53 +0000 (01:15 +0100)]
block: sed-opal: fix IOC_OPAL_ENABLE_DISABLE_MBR

The implementation of IOC_OPAL_ENABLE_DISABLE_MBR handled the value
opal_mbr_data.enable_disable incorrectly: enable_disable is expected
to be one of OPAL_MBR_ENABLE(0) or OPAL_MBR_DISABLE(1). enable_disable
was passed directly to set_mbr_done and set_mbr_enable_disable where
is was interpreted as either OPAL_TRUE(1) or OPAL_FALSE(0). The end
result was that calling IOC_OPAL_ENABLE_DISABLE_MBR with OPAL_MBR_ENABLE
actually disabled the shadow MBR and vice versa.

This patch adds correct conversion from OPAL_MBR_DISABLE/ENABLE to
OPAL_FALSE/TRUE. The change affects existing programs using
IOC_OPAL_ENABLE_DISABLE_MBR but this is typically used only once when
setting up an Opal drive.

Acked-by: Jon Derrick <jonathan.derrick@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: remove CONFIG_LBDAF
Christoph Hellwig [Fri, 5 Apr 2019 16:08:59 +0000 (18:08 +0200)]
block: remove CONFIG_LBDAF

Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit
architectures.  These types are required to support block device and/or
file sizes larger than 2 TiB, and have generally defaulted to on for
a long time.  Enabling the option only increases the i386 tinyconfig
size by 145 bytes, and many data structures already always use
64-bit values for their in-core and on-disk data structures anyway,
so there should not be a large change in dynamic memory usage either.

Dropping this option removes a somewhat weird non-default config that
has cause various bugs or compiler warnings when actually used.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoMerge branch 'nvme-5.2' of git://git.infradead.org/nvme into for-5.2/block
Jens Axboe [Fri, 5 Apr 2019 15:08:51 +0000 (09:08 -0600)]
Merge branch 'nvme-5.2' of git://git.infradead.org/nvme into for-5.2/block

Pull NVMe changes from Christoph:

"Below is the first batch of nvme updates for 5.2. This includes the
performance improvements for single segment I/O on PCIe, which introduce
new block helpers, so it might be a good idea to get them in early.

 - various performance optimizations in the PCIe code (Keith and me)
 - new block helpers to support the above (me)
 - nvmet error conversion cleanup (me)
 - nvmet-fc variable sized array cleanup (Gustavo)
 - passthrough ioctl error printk cleanup (Kenneth)
 - small nvmet fixes (Max)
 - endianess conversion cleanup (Max)
 - nvmet-tcp faspath completion optimization (Sagi)"

* 'nvme-5.2' of git://git.infradead.org/nvme: (24 commits)
  nvme: log the error status on Identify Namespace failure
  nvmet: add safety check for subsystem lock during nvmet_ns_changed
  nvmet: never fail double namespace enablement
  nvme-pci: tidy up nvme_map_data
  nvme-pci: optimize mapping single segment requests using SGLs
  nvme-pci: optimize mapping of small single segment requests
  nvme-pci: remove the inline scatterlist optimization
  nvme-pci: split metadata handling from nvme_map_data / nvme_unmap_data
  nvme-pci: do not build a scatterlist to map metadata
  nvme-pci: only call nvme_unmap_data for requests transferring data
  nvme-pci: merge nvme_free_iod into nvme_unmap_data
  nvme-pci: move the call to nvme_cleanup_cmd out of nvme_unmap_data
  nvme-pci: remove nvme_init_iod
  block: add dma_map_bvec helper
  block: add a rq_dma_dir helper
  block: add a rq_integrity_vec helper
  block: add a req_bvec helper
  nvme-pci: remove unused nvme_iod member
  nvme-pci: remove q_dmadev from nvme_queue
  nvme-pci: use a flag for polled queues
  ...

5 years agonvme: log the error status on Identify Namespace failure
Kenneth Heitke [Thu, 4 Apr 2019 18:57:45 +0000 (12:57 -0600)]
nvme: log the error status on Identify Namespace failure

Identify Namespace failures are logged as a warning but there is not
an indication of the cause for the failure. Update the log message to
include the error status.

Signed-off-by: Kenneth Heitke <kenneth.heitke@intel.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
5 years agonvmet: add safety check for subsystem lock during nvmet_ns_changed
Max Gurtovoy [Tue, 2 Apr 2019 11:51:54 +0000 (14:51 +0300)]
nvmet: add safety check for subsystem lock during nvmet_ns_changed

we need to make sure that subsystem lock is taken during ctrl's list
traversing. nvmet_ns_changed function is not static and can be used from
various callers simultaneously.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
5 years agonvmet: never fail double namespace enablement
Max Gurtovoy [Tue, 2 Apr 2019 11:52:47 +0000 (14:52 +0300)]
nvmet: never fail double namespace enablement

In case we create N namespaces while N < NVMET_MAX_NAMESPACES, we can
perform "echo 1 > <nsid>/enable" as much as we want. In case N ==
NVMET_MAX_NAMESPACES we fail. Make sure we have the same flow for any N.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
5 years agonvme-pci: tidy up nvme_map_data
Christoph Hellwig [Tue, 5 Mar 2019 12:59:02 +0000 (05:59 -0700)]
nvme-pci: tidy up nvme_map_data

Remove two pointless local variables, remove ret assignment that is
never used, move the use_sgl initialization closer to where it is used.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
5 years agonvme-pci: optimize mapping single segment requests using SGLs
Christoph Hellwig [Tue, 5 Mar 2019 12:54:18 +0000 (05:54 -0700)]
nvme-pci: optimize mapping single segment requests using SGLs

If the controller supports SGLs we can take another short cut for single
segment request, given that we can always map those without another
indirection structure, and thus don't need to create a scatterlist
structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
5 years agonvme-pci: optimize mapping of small single segment requests
Christoph Hellwig [Tue, 5 Mar 2019 12:49:34 +0000 (05:49 -0700)]
nvme-pci: optimize mapping of small single segment requests

If a request is single segment and fits into one or two PRP entries we
do not have to create a scatterlist for it, but can just map the bio_vec
directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
5 years agonvme-pci: remove the inline scatterlist optimization
Christoph Hellwig [Tue, 5 Mar 2019 12:46:58 +0000 (05:46 -0700)]
nvme-pci: remove the inline scatterlist optimization

We'll have a better way to optimize for small I/O that doesn't
require it soon, so remove the existing inline_sg case to make that
optimization easier to implement.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
5 years agonvme-pci: split metadata handling from nvme_map_data / nvme_unmap_data
Christoph Hellwig [Sun, 3 Mar 2019 16:46:28 +0000 (09:46 -0700)]
nvme-pci: split metadata handling from nvme_map_data / nvme_unmap_data

This prepares for some bigger changes to the data mapping helpers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
5 years agonvme-pci: do not build a scatterlist to map metadata
Christoph Hellwig [Sun, 3 Mar 2019 15:19:18 +0000 (08:19 -0700)]
nvme-pci: do not build a scatterlist to map metadata

We always have exactly one segment, so we can simply call dma_map_bvec.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
5 years agonvme-pci: only call nvme_unmap_data for requests transferring data
Christoph Hellwig [Sun, 3 Mar 2019 15:52:21 +0000 (08:52 -0700)]
nvme-pci: only call nvme_unmap_data for requests transferring data

This mirrors how nvme_map_pci is called and will allow simplifying some
checks in nvme_unmap_pci later on.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
5 years agonvme-pci: merge nvme_free_iod into nvme_unmap_data
Christoph Hellwig [Sun, 3 Mar 2019 15:15:19 +0000 (08:15 -0700)]
nvme-pci: merge nvme_free_iod into nvme_unmap_data

This means we now have a function that undoes everything nvme_map_data
does and we can simplify the error handling a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
5 years agonvme-pci: move the call to nvme_cleanup_cmd out of nvme_unmap_data
Christoph Hellwig [Sun, 3 Mar 2019 15:13:03 +0000 (08:13 -0700)]
nvme-pci: move the call to nvme_cleanup_cmd out of nvme_unmap_data

Cleaning up the command setup isn't related to unmapping data, and
disentangling them will simplify error handling a bit down the road.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
5 years agonvme-pci: remove nvme_init_iod
Christoph Hellwig [Sun, 3 Mar 2019 15:04:01 +0000 (08:04 -0700)]
nvme-pci: remove nvme_init_iod

nvme_init_iod should really be split into two parts: initialize a few
general iod fields, which can easily be done at the beginning of
nvme_queue_rq, and allocating the scatterlist if needed, which logically
belongs into nvme_map_data with the code making use of it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
5 years agoblock: add dma_map_bvec helper
Christoph Hellwig [Sun, 3 Mar 2019 15:40:36 +0000 (08:40 -0700)]
block: add dma_map_bvec helper

Provide a nice little shortcut for mapping a single bvec.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
5 years agoblock: add a rq_dma_dir helper
Christoph Hellwig [Sun, 3 Mar 2019 15:18:30 +0000 (08:18 -0700)]
block: add a rq_dma_dir helper

In a lot of places we want to know the DMA direction for a given
struct request.  Add a little helper to make it a littler easier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
5 years agoblock: add a rq_integrity_vec helper
Christoph Hellwig [Sun, 3 Mar 2019 15:38:29 +0000 (08:38 -0700)]
block: add a rq_integrity_vec helper

This provides a nice little shortcut to get the integrity data for
drivers like NVMe that only support a single integrity segment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
5 years agoblock: add a req_bvec helper
Christoph Hellwig [Sun, 3 Mar 2019 16:14:01 +0000 (09:14 -0700)]
block: add a req_bvec helper

Return the currently active bvec segment, potentially spanning multiple
pages.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
5 years agonvme-pci: remove unused nvme_iod member
Keith Busch [Fri, 8 Mar 2019 17:43:13 +0000 (10:43 -0700)]
nvme-pci: remove unused nvme_iod member

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
5 years agonvme-pci: remove q_dmadev from nvme_queue
Keith Busch [Fri, 8 Mar 2019 17:43:11 +0000 (10:43 -0700)]
nvme-pci: remove q_dmadev from nvme_queue

We don't need to save the dma device as it's not used in the hot path
and hasn't in a long time. Shrink the struct nvme_queue removing this
unnecessary member.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
5 years agonvme-pci: use a flag for polled queues
Keith Busch [Fri, 8 Mar 2019 17:43:06 +0000 (10:43 -0700)]
nvme-pci: use a flag for polled queues

A negative value for the cq_vector used to mean the queue is either
disabled or a polled queue. However, we have a queue enabled flag,
so the cq_vector had been serving double duty.

Don't overload the meaning of cq_vector. Use a flag specific to the
polled queues instead.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
5 years agonvmet-tcp: implement C2HData SUCCESS optimization
Sagi Grimberg [Fri, 8 Mar 2019 23:41:21 +0000 (15:41 -0800)]
nvmet-tcp: implement C2HData SUCCESS optimization

TP 8000 says that the use of the SUCCESS flag depends on weather the
controller support disabling sq_head pointer updates. Given that we
support it by default, makes sense that we go the extra mile to actually
use the SUCCESS flag.

When we create the C2HData PDU header, we check if sqhd_disabled is set
on our queue, if so, we set the SUCCESS flag in the PDU header and
skip sending a completion response capsule.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Oliver Smith-Denny <osmithde@cisco.com>
Tested-by: Oliver Smith-Denny <osmithde@cisco.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
5 years agonvmet-fc: use zero-sized array and struct_size() in kzalloc()
Gustavo A. R. Silva [Sat, 23 Feb 2019 18:51:08 +0000 (12:51 -0600)]
nvmet-fc: use zero-sized array and struct_size() in kzalloc()

Update the code to use a zero-sized array instead of a pointer in
structure nvmet_fc_tgt_queue and use struct_size() in kzalloc().

Notice that one of the more common cases of allocation size calculations
is finding the size of a structure that has a zero-sized array at the end,
along with memory for some number of elements for that array. For example:

struct foo {
int stuff;
struct boo entry[];
};

instance = kzalloc(sizeof(struct foo) + sizeof(struct boo) * count, GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can now
use the new struct_size() helper:

instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
5 years agonvmet: avoid double errno conversions
Christoph Hellwig [Tue, 12 Mar 2019 17:05:10 +0000 (18:05 +0100)]
nvmet: avoid double errno conversions

Use errno_to_nvme_status to convert from a negative errno to a
nvme status field instead of going through a blk_status_t.

Also remove the pointless status variable in
nvmet_bdev_execute_write_zeroes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
5 years agonvme: avoid double dereference to convert le to cpu
Max Gurtovoy [Mon, 25 Feb 2019 11:00:04 +0000 (13:00 +0200)]
nvme: avoid double dereference to convert le to cpu

Use le16_to_cpu instead of le16_to_cpup and le64_to_cpu instead of
le64_to_cpup. This will also align the code to nvme-core driver
convention.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
5 years agoblock: bio: ensure newly added bio flags don't override BVEC_POOL_IDX
Johannes Thumshirn [Wed, 3 Apr 2019 09:15:19 +0000 (11:15 +0200)]
block: bio: ensure newly added bio flags don't override BVEC_POOL_IDX

With the introduction of BIO_NO_PAGE_REF we've used up all available bits
in bio::bi_flags.

Convert the defines of the flags to an enum and add a BUILD_BUG_ON() call
to make sure no-one adds a new one and thus overrides the BVEC_POOL_IDX
causing crashes.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agomd: batch flush requests.
NeilBrown [Fri, 29 Mar 2019 17:46:17 +0000 (10:46 -0700)]
md: batch flush requests.

Currently if many flush requests are submitted to an md device is quick
succession, they are serialized and can take a long to process them all.
We don't really need to call flush all those times - a single flush call
can satisfy all requests submitted before it started.
So keep track of when the current flush started and when it finished,
allow any pending flush that was requested before the flush started
to complete without waiting any more.

Test results from Xiao:

Test is done on a raid10 device which is created by 4 SSDs. The tool is
dbench.

1. The latest linux stable kernel
  Operation                Count    AvgLat    MaxLat
  --------------------------------------------------
  Deltree                    768    10.509    78.305
  Flush                  2078376     0.013    10.094
  Close                  21787697     0.019    18.821
  LockX                    96580     0.007     3.184
  Mkdir                      384     0.008     0.062
  Rename                 1255883     0.191    23.534
  ReadX                  46495589     0.020    14.230
  WriteX                 14790591     7.123    60.706
  Unlink                 5989118     0.440    54.551
  UnlockX                  96580     0.005     2.736
  FIND_FIRST             10393845     0.042    12.079
  SET_FILE_INFORMATION   2415558     0.129    10.088
  QUERY_FILE_INFORMATION 4711725     0.005     8.462
  QUERY_PATH_INFORMATION 26883327     0.032    21.715
  QUERY_FS_INFORMATION   4929409     0.010     8.238
  NTCreateX              29660080     0.100    53.268

Throughput 1034.88 MB/sec (sync open)  128 clients  128 procs
max_latency=60.712 ms

2. With patch1 "Revert "MD: fix lock contention for flush bios""
  Operation                Count    AvgLat    MaxLat
  --------------------------------------------------
  Deltree                    256     8.326    36.761
  Flush                   693291     3.974   180.269
  Close                  7266404     0.009    36.929
  LockX                    32160     0.006     0.840
  Mkdir                      128     0.008     0.021
  Rename                  418755     0.063    29.945
  ReadX                  15498708     0.007     7.216
  WriteX                 4932310    22.482   267.928
  Unlink                 1997557     0.109    47.553
  UnlockX                  32160     0.004     1.110
  FIND_FIRST             3465791     0.036     7.320
  SET_FILE_INFORMATION    805825     0.015     1.561
  QUERY_FILE_INFORMATION 1570950     0.005     2.403
  QUERY_PATH_INFORMATION 8965483     0.013    14.277
  QUERY_FS_INFORMATION   1643626     0.009     3.314
  NTCreateX              9892174     0.061    41.278

Throughput 345.009 MB/sec (sync open)  128 clients  128 procs
max_latency=267.939 m

3. With patch1 and patch2
  Operation                Count    AvgLat    MaxLat
  --------------------------------------------------
  Deltree                    768     9.570    54.588
  Flush                  2061354     0.666    15.102
  Close                  21604811     0.012    25.697
  LockX                    95770     0.007     1.424
  Mkdir                      384     0.008     0.053
  Rename                 1245411     0.096    12.263
  ReadX                  46103198     0.011    12.116
  WriteX                 14667988     7.375    60.069
  Unlink                 5938936     0.173    30.905
  UnlockX                  95770     0.005     4.147
  FIND_FIRST             10306407     0.041    11.715
  SET_FILE_INFORMATION   2395987     0.048     7.640
  QUERY_FILE_INFORMATION 4672371     0.005     9.291
  QUERY_PATH_INFORMATION 26656735     0.018    19.719
  QUERY_FS_INFORMATION   4887940     0.010     7.654
  NTCreateX              29410811     0.059    28.551

Throughput 1026.21 MB/sec (sync open)  128 clients  128 procs
max_latency=60.075 ms

Cc: <stable@vger.kernel.org> # v4.19+
Tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoRevert "MD: fix lock contention for flush bios"
NeilBrown [Fri, 29 Mar 2019 17:46:16 +0000 (10:46 -0700)]
Revert "MD: fix lock contention for flush bios"

This reverts commit 5a409b4f56d50b212334f338cb8465d65550cd85.

This patch has two problems.

1/ it make multiple calls to submit_bio() from inside a make_request_fn.
 The bios thus submitted will be queued on current->bio_list and not
 submitted immediately.  As the bios are allocated from a mempool,
 this can theoretically result in a deadlock - all the pool of requests
 could be in various ->bio_list queues and a subsequent mempool_alloc
 could block waiting for one of them to be released.

2/ It aims to handle a case when there are many concurrent flush requests.
  It handles this by submitting many requests in parallel - all of which
  are identical and so most of which do nothing useful.
  It would be more efficient to just send one lower-level request, but
  allow that to satisfy multiple upper-level requests.

Fixes: 5a409b4f56d5 ("MD: fix lock contention for flush bios")
Cc: <stable@vger.kernel.org> # v4.19+
Tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoDon't jump to compute_result state from check_result state
Nigel Croxon [Fri, 29 Mar 2019 17:46:15 +0000 (10:46 -0700)]
Don't jump to compute_result state from check_result state

Changing state from check_state_check_result to
check_state_compute_result not only is unsafe but also doesn't
appear to serve a valid purpose.  A raid6 check should only be
pushing out extra writes if doing repair and a mis-match occurs.
The stripe dev management will already try and do repair writes
for failing sectors.

This patch makes the raid6 check_state_check_result handling
work more like raid5's.  If somehow too many failures for a
check, just quit the check operation for the stripe.  When any
checks pass, don't try and use check_state_compute_result for
a purpose it isn't needed for and is unsafe for.  Just mark the
stripe as in sync for passing its parity checks and let the
stripe dev read/write code and the bad blocks list do their
job handling I/O errors.

Repro steps from Xiao:

These are the steps to reproduce this problem:
1. redefined OPT_MEDIUM_ERR_ADDR to 12000 in scsi_debug.c
2. insmod scsi_debug.ko dev_size_mb=11000  max_luns=1 num_tgts=1
3. mdadm --create /dev/md127 --level=6 --raid-devices=5 /dev/sde1 /dev/sde2 /dev/sde3 /dev/sde5 /dev/sde6
sde is the disk created by scsi_debug
4. echo "2" >/sys/module/scsi_debug/parameters/opts
5. raid-check

It panic:
[ 4854.730899] md: data-check of RAID array md127
[ 4854.857455] sd 5:0:0:0: [sdr] tag#80 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4854.859246] sd 5:0:0:0: [sdr] tag#80 Sense Key : Medium Error [current]
[ 4854.860694] sd 5:0:0:0: [sdr] tag#80 Add. Sense: Unrecovered read error
[ 4854.862207] sd 5:0:0:0: [sdr] tag#80 CDB: Read(10) 28 00 00 00 2d 88 00 04 00 00
[ 4854.864196] print_req_error: critical medium error, dev sdr, sector 11656 flags 0
[ 4854.867409] sd 5:0:0:0: [sdr] tag#100 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4854.869469] sd 5:0:0:0: [sdr] tag#100 Sense Key : Medium Error [current]
[ 4854.871206] sd 5:0:0:0: [sdr] tag#100 Add. Sense: Unrecovered read error
[ 4854.872858] sd 5:0:0:0: [sdr] tag#100 CDB: Read(10) 28 00 00 00 2e e0 00 00 08 00
[ 4854.874587] print_req_error: critical medium error, dev sdr, sector 12000 flags 4000
[ 4854.876456] sd 5:0:0:0: [sdr] tag#101 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4854.878552] sd 5:0:0:0: [sdr] tag#101 Sense Key : Medium Error [current]
[ 4854.880278] sd 5:0:0:0: [sdr] tag#101 Add. Sense: Unrecovered read error
[ 4854.881846] sd 5:0:0:0: [sdr] tag#101 CDB: Read(10) 28 00 00 00 2e e8 00 00 08 00
[ 4854.883691] print_req_error: critical medium error, dev sdr, sector 12008 flags 4000
[ 4854.893927] sd 5:0:0:0: [sdr] tag#166 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4854.896002] sd 5:0:0:0: [sdr] tag#166 Sense Key : Medium Error [current]
[ 4854.897561] sd 5:0:0:0: [sdr] tag#166 Add. Sense: Unrecovered read error
[ 4854.899110] sd 5:0:0:0: [sdr] tag#166 CDB: Read(10) 28 00 00 00 2e e0 00 00 10 00
[ 4854.900989] print_req_error: critical medium error, dev sdr, sector 12000 flags 0
[ 4854.902757] md/raid:md127: read error NOT corrected!! (sector 9952 on sdr1).
[ 4854.904375] md/raid:md127: read error NOT corrected!! (sector 9960 on sdr1).
[ 4854.906201] ------------[ cut here ]------------
[ 4854.907341] kernel BUG at drivers/md/raid5.c:4190!

raid5.c:4190 above is this BUG_ON:

    handle_parity_checks6()
        ...
        BUG_ON(s->uptodate < disks - 1); /* We don't need Q to recover */

Cc: <stable@vger.kernel.org> # v3.16+
OriginalAuthor: David Jeffery <djeffery@redhat.com>
Cc: Xiao Ni <xni@redhat.com>
Tested-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: David Jeffy <djeffery@redhat.com>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: loop: mark bvec as ITER_BVEC_FLAG_NO_REF
Ming Lei [Thu, 28 Mar 2019 03:05:31 +0000 (11:05 +0800)]
block: loop: mark bvec as ITER_BVEC_FLAG_NO_REF

loop is one block device, for any bio submitted to this device,
the upper layer does guarantee that pages added to loop's bio won't
go away when the bio is in-flight.

So mark loop's bvec as ITER_BVEC_FLAG_NO_REF then get_page/put_page
can be saved for serving loop's IO.

Cc: linux-fsdevel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: don't check if adjacent bvecs in one bio can be mergeable
Ming Lei [Sun, 17 Mar 2019 10:01:12 +0000 (18:01 +0800)]
block: don't check if adjacent bvecs in one bio can be mergeable

Now both passthrough and FS IO have supported multi-page bvec, and
bvec merging has been handled actually when adding page to bio, then
adjacent bvecs won't be mergeable any more if they belong to same bio.

So only try to merge bvecs if they are from different bios.

Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: reuse __blk_bvec_map_sg() for mapping page sized bvec
Ming Lei [Sun, 17 Mar 2019 10:01:11 +0000 (18:01 +0800)]
block: reuse __blk_bvec_map_sg() for mapping page sized bvec

Inside __blk_segment_map_sg(), page sized bvec mapping is optimized
a bit with one standalone branch.

So reuse __blk_bvec_map_sg() to do that.

Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: remove argument of 'request_queue' from __blk_bvec_map_sg
Ming Lei [Sun, 17 Mar 2019 10:01:10 +0000 (18:01 +0800)]
block: remove argument of 'request_queue' from __blk_bvec_map_sg

The argument of 'request_queue' isn't used by __blk_bvec_map_sg(),
so remove it.

Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: enable multi-page bvec for passthrough IO
Ming Lei [Fri, 29 Mar 2019 07:08:00 +0000 (15:08 +0800)]
block: enable multi-page bvec for passthrough IO

Now block IO stack is basically ready for supporting multi-page bvec,
however it isn't enabled on passthrough IO.

One reason is that passthrough IO is dispatched to LLD directly and bio
split is bypassed, so the bio has to be built correctly for dispatch to
LLD from the beginning.

Implement multi-page support for passthrough IO by limitting each bvec
as block device's segment and applying all kinds of queue limit in
blk_add_pc_page(). Then we don't need to calculate segments any more for
passthrough IO any more, turns out code is simplified much.

Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: put the same page when adding it to bio
Ming Lei [Sun, 17 Mar 2019 10:01:08 +0000 (18:01 +0800)]
block: put the same page when adding it to bio

When the added page is merged to last same page in bio_add_pc_page(),
the user may need to put this page for avoiding page leak.

bio_map_user_iov() needs this kind of handling, and now it deals with
it by itself in hack style.

Moves the handling of put page into __bio_add_pc_page(), so
bio_map_user_iov() may be simplified a bit, and maybe more users
can benefit from this change.

Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: check if page is mergeable in one helper
Ming Lei [Sun, 17 Mar 2019 10:01:07 +0000 (18:01 +0800)]
block: check if page is mergeable in one helper

Now the check for deciding if one page is mergeable to current bvec
becomes a bit complicated, and we need to reuse the code before
adding pc page.

So move the check in one dedicated helper.

No function change.

Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: cleanup bio_add_pc_page
Ming Lei [Sun, 17 Mar 2019 10:01:06 +0000 (18:01 +0800)]
block: cleanup bio_add_pc_page

REQ_PC is out of date, so replace it with passthrough IO.

Also remove the local variable of 'prev' since we can reuse
the top local variable of 'bvec'.

No function change.

Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: don't merge adjacent bvecs to one segment in bio blk_queue_split
Ming Lei [Sun, 17 Mar 2019 10:01:05 +0000 (18:01 +0800)]
block: don't merge adjacent bvecs to one segment in bio blk_queue_split

For normal filesystem IO, each page is added via blk_add_page(),
in which bvec(page) merge has been handled already, and basically
not possible to merge two adjacent bvecs in one bio.

So not try to merge two adjacent bvecs in blk_queue_split().

Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: avoid to break XEN by multi-page bvec
Ming Lei [Sun, 17 Mar 2019 10:01:04 +0000 (18:01 +0800)]
block: avoid to break XEN by multi-page bvec

XEN has special page merge requirement, see xen_biovec_phys_mergeable().
We can't merge pages into one bvec simply for XEN.

So move XEN's specific check on page merge into __bio_try_merge_page(),
then abvoid to break XEN by multi-page bvec.

Cc: ris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: xen-devel@lists.xenproject.org
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: pass page to xen_biovec_phys_mergeable
Ming Lei [Fri, 29 Mar 2019 07:07:54 +0000 (15:07 +0800)]
block: pass page to xen_biovec_phys_mergeable

xen_biovec_phys_mergeable() only needs .bv_page of the 2nd bio bvec
for checking if the two bvecs can be merged, so pass page to
xen_biovec_phys_mergeable() directly.

No function change.

Cc: ris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: xen-devel@lists.xenproject.org
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoloop: properly observe rotational flag of underlying device
Holger Hoffstätte [Tue, 12 Feb 2019 22:54:24 +0000 (15:54 -0700)]
loop: properly observe rotational flag of underlying device

The loop driver always declares the rotational flag of its device as
rotational, even when the device of the mapped file is nonrotational,
as is the case with SSDs or on tmpfs. This can confuse filesystem tools
which are SSD-aware; in my case I frequently forget to tell mkfs.btrfs
that my loop device on tmpfs is nonrotational, and that I really don't
need any automatic metadata redundancy.

The attached patch fixes this by introspecting the rotational flag of the
mapped file's underlying block device, if it exists. If the mapped file's
filesystem has no associated block device - as is the case on e.g. tmpfs -
we assume nonrotational storage. If there is a better way to identify such
non-devices I'd love to hear them.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: holger@applied-asynchrony.com
Signed-off-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Signed-off-by: Gwendal Grignou <gwendal@chromium.org>
Signed-off-by: Benjamin Gordon <bmgordon@chromium.org>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agodoc, block, bfq: add information on bfq execution time
Paolo Valente [Tue, 12 Mar 2019 08:59:35 +0000 (09:59 +0100)]
doc, block, bfq: add information on bfq execution time

The execution time of BFQ has been slightly lowered. Report the new
execution time in BFQ documentation.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock, bfq: save & resume weight on a queue merge/split
Francesco Pollicino [Tue, 12 Mar 2019 08:59:34 +0000 (09:59 +0100)]
block, bfq: save & resume weight on a queue merge/split

bfq saves the state of a queue each time a merge occurs, to be
able to resume such a state when the queue is associated again
with its original process, on a split.

Unfortunately bfq does not save & restore also the weight of the
queue. If the weight is not correctly resumed when the queue is
recycled, then the weight of the recycled queue could differ
from the weight of the original queue.

This commit adds the missing save & resume of the weight.

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Francesco Pollicino <fra.fra.800@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock, bfq: print SHARED instead of pid for shared queues in logs
Francesco Pollicino [Tue, 12 Mar 2019 08:59:33 +0000 (09:59 +0100)]
block, bfq: print SHARED instead of pid for shared queues in logs

The function "bfq_log_bfqq" prints the pid of the process
associated with the queue passed as input.

Unfortunately, if the queue is shared, then more than one process
is associated with the queue. The pid that gets printed in this
case is the pid of one of the associated processes.
Which process gets printed depends on the exact sequence of merge
events the queue underwent. So printing such a pid is rather
useless and above all is often rather confusing because it
reports a random pid between those of the associated processes.

This commit addresses this issue by printing SHARED instead of a pid
if the queue is shared.

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Francesco Pollicino <fra.fra.800@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock, bfq: always protect newly-created queues from existing active queues
Paolo Valente [Tue, 12 Mar 2019 08:59:32 +0000 (09:59 +0100)]
block, bfq: always protect newly-created queues from existing active queues

If many bfq_queues belonging to the same group happen to be created
shortly after each other, then the processes associated with these
queues have typically a common goal. In particular, bursts of queue
creations are usually caused by services or applications that spawn
many parallel threads/processes. Examples are systemd during boot, or
git grep. If there are no other active queues, then, to help these
processes get their job done as soon as possible, the best thing to do
is to reach a high throughput. To this goal, it is usually better to
not grant either weight-raising or device idling to the queues
associated with these processes. And this is exactly what BFQ
currently does.

There is however a drawback: if, in contrast, some other queues are
already active, then the newly created queues must be protected from
the I/O flowing through the already existing queues. In this case, the
best thing to do is the opposite as in the other case: it is much
better to grant weight-raising and device idling to the newly-created
queues, if they deserve it. This commit addresses this issue by doing
so if there are already other active queues.

This change also helps eliminating false positives, which occur when
the newly-created queues do not belong to an actual large burst of
creations, but some background task (e.g., a service) happens to
trigger the creation of new queues in the middle, i.e., very close to
when the victim queues are created. These false positive may cause
total loss of control on process latencies.

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock, bfq: do not tag totally seeky queues as soft rt
Paolo Valente [Tue, 12 Mar 2019 08:59:31 +0000 (09:59 +0100)]
block, bfq: do not tag totally seeky queues as soft rt

Sync random I/O is likely to be confused with soft real-time I/O,
because it is characterized by limited throughput and apparently
isochronous arrival pattern. To avoid false positives, this commits
prevents bfq_queues containing only random (seeky) I/O from being
tagged as soft real-time.

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock, bfq: do not merge queues on flash storage with queueing
Paolo Valente [Tue, 12 Mar 2019 08:59:30 +0000 (09:59 +0100)]
block, bfq: do not merge queues on flash storage with queueing

To boost throughput with a set of processes doing interleaved I/O
(i.e., a set of processes whose individual I/O is random, but whose
merged cumulative I/O is sequential), BFQ merges the queues associated
with these processes, i.e., redirects the I/O of these processes into a
common, shared queue. In the shared queue, I/O requests are ordered by
their position on the medium, thus sequential I/O gets dispatched to
the device when the shared queue is served.

Queue merging costs execution time, because, to detect which queues to
merge, BFQ must maintain a list of the head I/O requests of active
queues, ordered by request positions. Measurements showed that this
costs about 10% of BFQ's total per-request processing time.

Request processing time becomes more and more critical as the speed of
the underlying storage device grows. Yet, fortunately, queue merging
is basically useless on the very devices that are so fast to make
request processing time critical. To reach a high throughput, these
devices must have many requests queued at the same time. But, in this
configuration, the internal scheduling algorithms of these devices do
also the job of queue merging: they reorder requests so as to obtain
as much as possible a sequential I/O pattern. As a consequence, with
processes doing interleaved I/O, the throughput reached by one such
device is likely to be the same, with and without queue merging.

In view of this fact, this commit disables queue merging, and all
related housekeeping, for non-rotational devices with internal
queueing. The total, single-lock-protected, per-request processing
time of BFQ drops to, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz
(time measured with simple code instrumentation, and using the
throughput-sync.sh script of the S suite [1], in performance-profiling
mode). To put this result into context, the total,
single-lock-protected, per-request execution time of the lightest I/O
scheduler available in blk-mq, mq-deadline, is 0.7 us (mq-deadline is
~800 LOC, against ~10500 LOC for BFQ).

Disabling merging provides a further, remarkable benefit in terms of
throughput. Merging tends to make many workloads artificially more
uneven, mainly because of shared queues remaining non empty for
incomparably more time than normal queues. So, if, e.g., one of the
queues in a set of merged queues has a higher weight than a normal
queue, then the shared queue may inherit such a high weight and, by
staying almost always active, may force BFQ to perform I/O plugging
most of the time. This evidently makes it harder for BFQ to let the
device reach a high throughput.

As a practical example of this problem, and of the benefits of this
commit, we measured again the throughput in the nasty scenario
considered in previous commit messages: dbench test (in the Phoronix
suite), with 6 clients, on a filesystem with journaling, and with the
journaling daemon enjoying a higher weight than normal processes. With
this commit, the throughput grows from ~150 MB/s to ~200 MB/s on a
PLEXTOR PX-256M5 SSD. This is the same peak throughput reached by any
of the other I/O schedulers. As such, this is also likely to be the
maximum possible throughput reachable with this workload on this
device, because I/O is mostly random, and the other schedulers
basically just pass I/O requests to the drive as fast as possible.

[1] https://github.com/Algodev-github/S

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Francesco Pollicino <fra.fra.800@gmail.com>
Signed-off-by: Alessio Masola <alessio.masola@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock, bfq: tune service injection basing on request service times
Paolo Valente [Tue, 12 Mar 2019 08:59:29 +0000 (09:59 +0100)]
block, bfq: tune service injection basing on request service times

The processes associated with a bfq_queue, say Q, may happen to
generate their cumulative I/O at a lower rate than the rate at which
the device could serve the same I/O. This is rather probable, e.g., if
only one process is associated with Q and the device is an SSD. It
results in Q becoming often empty while in service. If BFQ is not
allowed to switch to another queue when Q becomes empty, then, during
the service of Q, there will be frequent "service holes", i.e., time
intervals during which Q gets empty and the device can only consume
the I/O already queued in its hardware queues. This easily causes
considerable losses of throughput.

To counter this problem, BFQ implements a request injection mechanism,
which tries to fill the above service holes with I/O requests taken
from other bfq_queues. The hard part in this mechanism is finding the
right amount of I/O to inject, so as to both boost throughput and not
break Q's bandwidth and latency guarantees. To this goal, the current
version of this mechanism measures the bandwidth enjoyed by Q while it
is being served, and tries to inject the maximum possible amount of
extra service that does not cause Q's bandwidth to decrease too
much.

This solution has an important shortcoming. For bandwidth measurements
to be stable and reliable, Q must remain in service for a much longer
time than that needed to serve a single I/O request. Unfortunately,
this does not hold with many workloads. This commit addresses this
issue by changing the way the amount of injection allowed is
dynamically computed. It tunes injection as a function of the service
times of single I/O requests of Q, instead of Q's
bandwidth. Single-request service times are evidently meaningful even
if Q gets very few I/O requests completed while it is in service.

As a testbed for this new solution, we measured the throughput reached
by BFQ for one of the nastiest workloads and configurations for this
scheduler: the workload generated by the dbench test (in the Phoronix
suite), with 6 clients, on a filesystem with journaling, and with the
journaling daemon enjoying a higher weight than normal processes.
With this commit, the throughput grows from ~100 MB/s to ~150 MB/s on
a PLEXTOR PX-256M5.

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Francesco Pollicino <fra.fra.800@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock, bfq: do not idle for lowest-weight queues
Paolo Valente [Tue, 12 Mar 2019 08:59:28 +0000 (09:59 +0100)]
block, bfq: do not idle for lowest-weight queues

In most cases, it is detrimental for throughput to plug I/O dispatch
when the in-service bfq_queue becomes temporarily empty (plugging is
performed to wait for the possible arrival, soon, of new I/O from the
in-service queue). There is however a case where plugging is needed
for service guarantees. If a bfq_queue, say Q, has a higher weight
than some other active bfq_queue, and is sync, i.e., contains sync
I/O, then, to guarantee that Q does receive a higher share of the
throughput than other lower-weight queues, it is necessary to plug I/O
dispatch when Q remains temporarily empty while being served.

For this reason, BFQ performs I/O plugging when some active bfq_queue
has a higher weight than some other active bfq_queue. But this is
unnecessarily overkill. In fact, if the in-service bfq_queue actually
has a weight lower than or equal to the other queues, then the queue
*must not* be guaranteed a higher share of the throughput than the
other queues. So, not plugging I/O cannot cause any harm to the
queue. And can boost throughput.

Taking advantage of this fact, this commit does not plug I/O for sync
bfq_queues with a weight lower than or equal to the weights of the
other queues. Here is an example of the resulting throughput boost
with the dbench workload, which is particularly nasty for BFQ. With
the dbench test in the Phoronix suite, BFQ reaches its lowest total
throughput with 6 clients on a filesystem with journaling, in case the
journaling daemon has a higher weight than normal processes. Before
this commit, the total throughput was ~80 MB/sec on a PLEXTOR PX-256M5,
after this commit it is ~100 MB/sec.

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock, bfq: increase idling for weight-raised queues
Paolo Valente [Tue, 12 Mar 2019 08:59:27 +0000 (09:59 +0100)]
block, bfq: increase idling for weight-raised queues

If a sync bfq_queue has a higher weight than some other queue, and
remains temporarily empty while in service, then, to preserve the
bandwidth share of the queue, it is necessary to plug I/O dispatching
until a new request arrives for the queue. In addition, a timeout
needs to be set, to avoid waiting for ever if the process associated
with the queue has actually finished its I/O.

Even with the above timeout, the device is however not fed with new
I/O for a while, if the process has finished its I/O. If this happens
often, then throughput drops and latencies grow. For this reason, the
timeout is kept rather low: 8 ms is the current default.

Unfortunately, such a low value may cause, on the opposite end, a
violation of bandwidth guarantees for a process that happens to issue
new I/O too late. The higher the system load, the higher the
probability that this happens to some process. This is a problem in
scenarios where service guarantees matter more than throughput. One
important case are weight-raised queues, which need to be granted a
very high fraction of the bandwidth.

To address this issue, this commit lower-bounds the plugging timeout
for weight-raised queues to 20 ms. This simple change provides
relevant benefits. For example, on a PLEXTOR PX-256M5S, with which
gnome-terminal starts in 0.6 seconds if there is no other I/O in
progress, the same applications starts in
- 0.8 seconds, instead of 1.2 seconds, if ten files are being read
  sequentially in parallel
- 1 second, instead of 2 seconds, if, in parallel, five files are
  being read sequentially, and five more files are being written
  sequentially

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock/bfq: fix ifdef for CONFIG_BFQ_GROUP_IOSCHED=y
Konstantin Khlebnikov [Fri, 29 Mar 2019 14:01:18 +0000 (17:01 +0300)]
block/bfq: fix ifdef for CONFIG_BFQ_GROUP_IOSCHED=y

Replace BFQ_GROUP_IOSCHED_ENABLED with CONFIG_BFQ_GROUP_IOSCHED.
Code under these ifdefs never worked, something might be broken.

Fixes: 0471559c2fbd ("block, bfq: add/remove entity weights correctly")
Fixes: 73d58118498b ("block, bfq: consider also ioprio classes in symmetry detection")
Reviewed-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoLinux 5.1-rc3
Linus Torvalds [Sun, 31 Mar 2019 21:39:29 +0000 (14:39 -0700)]
Linux 5.1-rc3

5 years agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Linus Torvalds [Sun, 31 Mar 2019 15:55:59 +0000 (08:55 -0700)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "A collection of x86 and ARM bugfixes, and some improvements to
  documentation.

  On top of this, a cleanup of kvm_para.h headers, which were exported
  by some architectures even though they not support KVM at all. This is
  responsible for all the Kbuild changes in the diffstat"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (28 commits)
  Documentation: kvm: clarify KVM_SET_USER_MEMORY_REGION
  KVM: doc: Document the life cycle of a VM and its resources
  KVM: selftests: complete IO before migrating guest state
  KVM: selftests: disable stack protector for all KVM tests
  KVM: selftests: explicitly disable PIE for tests
  KVM: selftests: assert on exit reason in CR4/cpuid sync test
  KVM: x86: update %rip after emulating IO
  x86/kvm/hyper-v: avoid spurious pending stimer on vCPU init
  kvm/x86: Move MSR_IA32_ARCH_CAPABILITIES to array emulated_msrs
  KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hosts
  kvm: don't redefine flags as something else
  kvm: mmu: Used range based flushing in slot_handle_level_range
  KVM: export <linux/kvm_para.h> and <asm/kvm_para.h> iif KVM is supported
  KVM: x86: remove check on nr_mmu_pages in kvm_arch_commit_memory_region()
  kvm: nVMX: Add a vmentry check for HOST_SYSENTER_ESP and HOST_SYSENTER_EIP fields
  KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation)
  KVM: Reject device ioctls from processes other than the VM's creator
  KVM: doc: Fix incorrect word ordering regarding supported use of APIs
  KVM: x86: fix handling of role.cr4_pae and rename it to 'gpte_size'
  KVM: nVMX: Do not inherit quadrant and invalid for the root shadow EPT
  ...

5 years agoMerge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 31 Mar 2019 15:40:15 +0000 (08:40 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
 "A pile of x86 updates:

   - Prevent exceeding he valid physical address space in the /dev/mem
     limit checks.

   - Move all header content inside the header guard to prevent compile
     failures.

   - Fix the bogus __percpu annotation in this_cpu_has() which makes
     sparse very noisy.

   - Disable switch jump tables completely when retpolines are enabled.

   - Prevent leaking the trampoline address"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/realmode: Make set_real_mode_mem() static inline
  x86/cpufeature: Fix __percpu annotation in this_cpu_has()
  x86/mm: Don't exceed the valid physical address space
  x86/retpolines: Disable switch jump tables when retpolines are enabled
  x86/realmode: Don't leak the trampoline kernel address
  x86/boot: Fix incorrect ifdeffery scope
  x86/resctrl: Remove unused variable

5 years agoMerge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 31 Mar 2019 15:37:04 +0000 (08:37 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull perf tooling fixes from Thomas Gleixner:
 "Core libraries:
   - Fix max perf_event_attr.precise_ip detection.
   - Fix parser error for uncore event alias
   - Fixup ordering of kernel maps after obtaining the main kernel map
     address.

  Intel PT:
   - Fix TSC slip where A TSC packet can slip past MTC packets so that
     the timestamp appears to go backwards.
   - Fixes for exported-sql-viewer GUI conversion to python3.

  ARM coresight:
   - Fix the build by adding a missing case value for enumeration value
     introduced in newer library, that now is the required one.

  tool headers:
   - Syncronize kernel headers with the kernel, getting new io_uring and
     pidfd_send_signal syscalls so that 'perf trace' can handle them"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf pmu: Fix parser error for uncore event alias
  perf scripts python: exported-sql-viewer.py: Fix python3 support
  perf scripts python: exported-sql-viewer.py: Fix never-ending loop
  perf machine: Update kernel map address and re-order properly
  tools headers uapi: Sync powerpc's asm/kvm.h copy with the kernel sources
  tools headers: Update x86's syscall_64.tbl and uapi/asm-generic/unistd
  tools headers uapi: Update drm/i915_drm.h
  tools arch x86: Sync asm/cpufeatures.h with the kernel sources
  tools headers uapi: Sync linux/fcntl.h to get the F_SEAL_FUTURE_WRITE addition
  tools headers uapi: Sync asm-generic/mman-common.h and linux/mman.h
  perf evsel: Fix max perf_event_attr.precise_ip detection
  perf intel-pt: Fix TSC slip
  perf cs-etm: Add missing case value

5 years agoMerge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 31 Mar 2019 15:22:12 +0000 (08:22 -0700)]
Merge branch 'smp-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull CPU hotplug fixes from Thomas Gleixner:
 "Two SMT/hotplug related fixes:

   - Prevent crash when HOTPLUG_CPU is disabled and the CPU bringup
     aborts. This is triggered with the 'nosmt' command line option, but
     can happen by any abort condition. As the real unplug code is not
     compiled in, prevent the fail by keeping the CPU in zombie state.

   - Enforce HOTPLUG_CPU for SMP on x86 to avoid the above situation
     completely. With 'nosmt' being a popular option it's required to
     unplug the half brought up sibling CPUs (due to the MCE wreckage)
     completely"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/smp: Enforce CONFIG_HOTPLUG_CPU when SMP=y
  cpu/hotplug: Prevent crash when CPU bringup fails on CONFIG_HOTPLUG_CPU=n

5 years agoMerge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 31 Mar 2019 14:48:58 +0000 (07:48 -0700)]
Merge branch 'locking-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull locking fixlet from Thomas Gleixner:
 "Trivial update to the maintainers file"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  MAINTAINERS: Remove deleted file from futex file pattern

5 years agoMerge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 31 Mar 2019 14:47:21 +0000 (07:47 -0700)]
Merge branch 'core-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull core fixes from Thomas Gleixner:
 "A small set of core updates:

   - Make the watchdog respect the selected CPU mask again. That was
     broken by the rework of the watchdog thread management and caused
     inconsistent state and NMI watchdog being unstoppable.

   - Ensure that the objtool build can find the libelf location.

   - Remove dead kcore stub code"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  watchdog: Respect watchdog cpumask on CPU hotplug
  objtool: Query pkg-config for libelf location
  proc/kcore: Remove unused kclist_add_remap()

5 years agoMerge tag 'powerpc-5.1-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc...
Linus Torvalds [Sun, 31 Mar 2019 14:44:13 +0000 (07:44 -0700)]
Merge tag 'powerpc-5.1-4' of git://git./linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "Three non-regression fixes.

   - Our optimised memcmp could read past the end of one of the buffers
     and potentially trigger a page fault leading to an oops.

   - Some of our code to read energy management data on PowerVM had an
     endian bug leading to bogus results.

   - When reporting a machine check exception we incorrectly reported
     TLB multihits as D-Cache multhits due to a missing entry in the
     array of causes.

  Thanks to: Chandan Rajendra, Gautham R. Shenoy, Mahesh Salgaonkar,
  Segher Boessenkool, Vaidyanathan Srinivasan"

* tag 'powerpc-5.1-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/pseries/mce: Fix misleading print for TLB mutlihit
  powerpc/pseries/energy: Use OF accessor functions to read ibm,drc-indexes
  powerpc/64: Fix memcmp reading past the end of src/dest

5 years agoMerge tag 'dmaengine-fix-5.1-rc3' of git://git.infradead.org/users/vkoul/slave-dma
Linus Torvalds [Sun, 31 Mar 2019 14:42:39 +0000 (07:42 -0700)]
Merge tag 'dmaengine-fix-5.1-rc3' of git://git.infradead.org/users/vkoul/slave-dma

Pull dmaengine fixes from Vinod Koul:

 - Revert "dmaengine: stm32-mdma: Add a check on read_u32_array" as that
   caused regression

 - Fix MAINTAINER file uniphier-mdmac.c file path

* tag 'dmaengine-fix-5.1-rc3' of git://git.infradead.org/users/vkoul/slave-dma:
  MAINTAINERS: Fix uniphier-mdmac.c file path
  dmaengine: stm32-mdma: Revert "dmaengine: stm32-mdma: Add a check on read_u32_array"

5 years agoMerge tag 'led-fixes-for-5.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 30 Mar 2019 19:12:56 +0000 (12:12 -0700)]
Merge tag 'led-fixes-for-5.1-rc3' of git://git./linux/kernel/git/j.anaszewski/linux-leds

Pull LED fixes from Jacek Anaszewski:

 - fix refcnt leak on interface rename

 - use memcpy in device_name_store() to avoid including garbage from a
   previous, longer value in the device_name

 - fix a potential NULL pointer dereference in case of_match_device()
   cannot find a match

* tag 'led-fixes-for-5.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds:
  leds: trigger: netdev: use memcpy in device_name_store
  leds: pca9532: fix a potential NULL pointer dereference
  leds: trigger: netdev: fix refcnt leak on interface rename

5 years agoMerge tag 'gpio-v5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux...
Linus Torvalds [Sat, 30 Mar 2019 18:33:34 +0000 (11:33 -0700)]
Merge tag 'gpio-v5.1-2' of git://git./linux/kernel/git/linusw/linux-gpio

Pull GPIO fixes from Linus Walleij:
 "As you can see [in the git history] I was away on leave and Bartosz
  kindly stepped in and collected a slew of fixes, I pulled them into my
  tree in two sets and merged some two more fixes (fixing my own caused
  bugs) on top.

  Summary:

   - Revert the extended use of gpio_set_config() and think about how we
     can do this properly.

   - Fix up the SPI CS GPIO handling so it now works properly on the SPI
     bus children, as intended.

   - Error paths and driver fixes"

* tag 'gpio-v5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
  gpio: mockup: use simple_read_from_buffer() in debugfs read callback
  gpio: of: Fix of_gpiochip_add() error path
  gpio: of: Check for "spi-cs-high" in child instead of parent node
  gpio: of: Check propname before applying "cs-gpios" quirks
  gpio: mockup: fix debugfs read
  Revert "gpio: use new gpio_set_config() helper in more places"
  gpio: aspeed: fix a potential NULL pointer dereference
  gpio: amd-fch: Fix bogus SPDX identifier
  gpio: adnp: Fix testing wrong value in adnp_gpio_direction_input
  gpio: exar: add a check for the return value of ida_simple_get fails

5 years agoleds: trigger: netdev: use memcpy in device_name_store
Rasmus Villemoes [Thu, 14 Mar 2019 14:06:14 +0000 (15:06 +0100)]
leds: trigger: netdev: use memcpy in device_name_store

If userspace doesn't end the input with a newline (which can easily
happen if the write happens from a C program that does write(fd,
iface, strlen(iface))), we may end up including garbage from a
previous, longer value in the device_name. For example

# cat device_name

# printf 'eth12' > device_name
# cat device_name
eth12
# printf 'eth3' > device_name
# cat device_name
eth32

I highly doubt anybody is relying on this behaviour, so switch to
simply copying the bytes (we've already checked that size is <
IFNAMSIZ) and unconditionally zero-terminate it; of course, we also
still have to strip a trailing newline.

This is also preparation for future patches.

Fixes: 06f502f57d0d ("leds: trigger: Introduce a NETDEV trigger")
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jacek Anaszewski <jacek.anaszewski@gmail.com>