Christoph Hellwig [Mon, 18 Feb 2019 08:37:42 +0000 (09:37 +0100)]
nvme-tcp.h: fix SPDX header
For .h files we need to use /* */ style comments.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Christoph Hellwig [Mon, 18 Feb 2019 08:37:13 +0000 (09:37 +0100)]
nvme_ioctl.h: remove duplicate GPL boilerplate
We already have a ЅPDX header, so no need to duplicate the information.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Hannes Reinecke [Tue, 19 Feb 2019 12:13:57 +0000 (13:13 +0100)]
nvme: return error from nvme_alloc_ns()
nvme_alloc_ns() might fail, so we should be returning an error code.
Signed-off-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Bart Van Assche [Thu, 14 Feb 2019 22:50:57 +0000 (14:50 -0800)]
nvme: avoid that deleting a controller triggers a circular locking complaint
Rework nvme_delete_ctrl_sync() such that it does not have to wait for
queued work. This patch avoids that test nvme/008 triggers the following
complaint:
WARNING: possible circular locking dependency detected
5.0.0-rc6-dbg+ #10 Not tainted
------------------------------------------------------
nvme/7918 is trying to acquire lock:
000000009a1a7b69 ((work_completion)(&ctrl->delete_work)){+.+.}, at: __flush_work+0x379/0x410
but task is already holding lock:
00000000ef5a45b4 (kn->count#389){++++}, at: kernfs_remove_self+0x196/0x210
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (kn->count#389){++++}:
lock_acquire+0xc5/0x1e0
__kernfs_remove+0x42a/0x4a0
kernfs_remove_by_name_ns+0x45/0x90
remove_files.isra.1+0x3a/0x90
sysfs_remove_group+0x5c/0xc0
sysfs_remove_groups+0x39/0x60
device_remove_attrs+0x68/0xb0
device_del+0x24d/0x570
cdev_device_del+0x1a/0x50
nvme_delete_ctrl_work+0xbd/0xe0
process_one_work+0x4f1/0xa40
worker_thread+0x67/0x5b0
kthread+0x1cf/0x1f0
ret_from_fork+0x24/0x30
-> #0 ((work_completion)(&ctrl->delete_work)){+.+.}:
__lock_acquire+0x1323/0x17b0
lock_acquire+0xc5/0x1e0
__flush_work+0x399/0x410
flush_work+0x10/0x20
nvme_delete_ctrl_sync+0x65/0x70
nvme_sysfs_delete+0x4f/0x60
dev_attr_store+0x3e/0x50
sysfs_kf_write+0x87/0xa0
kernfs_fop_write+0x186/0x240
__vfs_write+0xd7/0x430
vfs_write+0xfa/0x260
ksys_write+0xab/0x130
__x64_sys_write+0x43/0x50
do_syscall_64+0x71/0x210
entry_SYSCALL_64_after_hwframe+0x49/0xbe
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(kn->count#389);
lock((work_completion)(&ctrl->delete_work));
lock(kn->count#389);
lock((work_completion)(&ctrl->delete_work));
*** DEADLOCK ***
3 locks held by nvme/7918:
#0:
00000000e2223b44 (sb_writers#6){.+.+}, at: vfs_write+0x1eb/0x260
#1:
000000003404976f (&of->mutex){+.+.}, at: kernfs_fop_write+0x128/0x240
#2:
00000000ef5a45b4 (kn->count#389){++++}, at: kernfs_remove_self+0x196/0x210
stack backtrace:
CPU: 4 PID: 7918 Comm: nvme Not tainted 5.0.0-rc6-dbg+ #10
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
dump_stack+0x86/0xca
print_circular_bug.isra.36.cold.54+0x173/0x1d5
check_prev_add.constprop.45+0x996/0x1110
__lock_acquire+0x1323/0x17b0
lock_acquire+0xc5/0x1e0
__flush_work+0x399/0x410
flush_work+0x10/0x20
nvme_delete_ctrl_sync+0x65/0x70
nvme_sysfs_delete+0x4f/0x60
dev_attr_store+0x3e/0x50
sysfs_kf_write+0x87/0xa0
kernfs_fop_write+0x186/0x240
__vfs_write+0xd7/0x430
vfs_write+0xfa/0x260
ksys_write+0xab/0x130
__x64_sys_write+0x43/0x50
do_syscall_64+0x71/0x210
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Bart Van Assche [Thu, 14 Feb 2019 22:50:56 +0000 (14:50 -0800)]
nvme: introduce a helper function for controller deletion
This patch does not change any functionality but makes the next patch
in this series easier to read.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Bart Van Assche [Thu, 14 Feb 2019 22:50:55 +0000 (14:50 -0800)]
nvme: unexport nvme_delete_ctrl_sync()
Since nvme_delete_ctrl_sync() is not called from any other kernel module,
unexport it.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Bart Van Assche [Thu, 14 Feb 2019 22:50:54 +0000 (14:50 -0800)]
nvme-pci: check kstrtoint() return value in queue_count_set()
This patch avoids that the compiler complains about 'ret' being set
but not being used when building with W=1.
Fixes:
3b6592f70ad7 ("nvme: utilize two queue maps, one for reads and one for writes") # v5.0-rc1
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Bart Van Assche [Thu, 14 Feb 2019 22:50:53 +0000 (14:50 -0800)]
nvme-fabrics: document the poll function argument
This patch avoids that the kernel-doc tool reports a warning when
building with W=1.
Fixes:
26c682274e0a ("nvme-fabrics: allow nvmf_connect_io_queue to poll") # v5.0-rc1
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Bart Van Assche [Thu, 14 Feb 2019 22:50:52 +0000 (14:50 -0800)]
nvmet: fix indentation
This patch avoids that smatch complains about inconsistent indentation.
Fixes:
a07b4970f464 ("nvmet: add a generic NVMe target") # v4.10
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Hannes Reinecke [Mon, 18 Feb 2019 10:43:26 +0000 (11:43 +0100)]
nvme-multipath: round-robin I/O policy
Implement a simple round-robin I/O policy for multipathing. Path
selection is done in two rounds, first iterating across all optimized
paths, and if that doesn't return any valid paths, iterate over all
optimized and non-optimized paths. If no paths are found, use the
existing algorithm. Also add a sysfs attribute 'iopolicy' to switch
between the current NUMA-aware I/O policy and the 'round-robin' I/O
policy.
Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Ming Lei [Tue, 19 Feb 2019 03:08:19 +0000 (11:08 +0800)]
block: avoid to READ fields of null bio
rq->bio can be NULL sometimes, such as flush request, so don't
read bio->bi_seg_front_size until this 'bio' is checked as valid.
Cc: Bart Van Assche <bvanassche@acm.org>
Reported-by: Bart Van Assche <bvanassche@acm.org>
Fixes:
dcebd755926b0f39dd1e ("block: use bio_for_each_bvec() to compute multi-page bvec count")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 15 Feb 2019 15:43:59 +0000 (08:43 -0700)]
Merge tag 'v5.0-rc6' into for-5.1/block
Pull in 5.0-rc6 to avoid a dumb merge conflict with fs/iomap.c.
This is needed since io_uring is now based on the block branch,
to avoid a conflict between the multi-page bvecs and the bits
of io_uring that touch the core block parts.
* tag 'v5.0-rc6': (525 commits)
Linux 5.0-rc6
x86/mm: Make set_pmd_at() paravirt aware
MAINTAINERS: Update the ocores i2c bus driver maintainer, etc
blk-mq: remove duplicated definition of blk_mq_freeze_queue
Blk-iolatency: warn on negative inflight IO counter
blk-iolatency: fix IO hang due to negative inflight counter
MAINTAINERS: unify reference to xen-devel list
x86/mm/cpa: Fix set_mce_nospec()
futex: Handle early deadlock return correctly
futex: Fix barrier comment
net: dsa: b53: Fix for failure when irq is not defined in dt
blktrace: Show requests without sector
mips: cm: reprime error cause
mips: loongson64: remove unreachable(), fix loongson_poweroff().
sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach()
geneve: should not call rt6_lookup() when ipv6 was disabled
KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221)
KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222)
kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974)
signal: Better detection of synchronous signals
...
Ming Lei [Fri, 15 Feb 2019 11:13:24 +0000 (19:13 +0800)]
block: kill BLK_MQ_F_SG_MERGE
QUEUE_FLAG_NO_SG_MERGE has been killed, so kill BLK_MQ_F_SG_MERGE too.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 15 Feb 2019 11:13:23 +0000 (19:13 +0800)]
block: kill QUEUE_FLAG_NO_SG_MERGE
Since
bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting"),
physical segment number is mainly figured out in blk_queue_split() for
fast path, and the flag of BIO_SEG_VALID is set there too.
Now only blk_recount_segments() and blk_recalc_rq_segments() use this
flag.
Basically blk_recount_segments() is bypassed in fast path given BIO_SEG_VALID
is set in blk_queue_split().
For another user of blk_recalc_rq_segments():
- run in partial completion branch of blk_update_request, which is an unusual case
- run in blk_cloned_rq_check_limits(), still not a big problem if the flag is killed
since dm-rq is the only user.
Multi-page bvec is enabled now, not doing S/G merging is rather pointless with the
current setup of the I/O path, as it isn't going to save you a significant amount
of cycles.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 15 Feb 2019 11:13:22 +0000 (19:13 +0800)]
block: document usage of bio iterator helpers
Now multi-page bvec is supported, some helpers may return page by
page, meantime some may return segment by segment, this patch
documents the usage.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 15 Feb 2019 11:13:21 +0000 (19:13 +0800)]
block: always define BIO_MAX_PAGES as 256
Now multi-page bvec can cover CONFIG_THP_SWAP, so we don't need to
increase BIO_MAX_PAGES for it.
CONFIG_THP_SWAP needs to split one THP into normal pages and adds
them all to one bio. With multipage-bvec, it just takes one bvec to
hold them all.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 15 Feb 2019 11:13:20 +0000 (19:13 +0800)]
block: enable multipage bvecs
This patch pulls the trigger for multi-page bvecs.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 15 Feb 2019 11:13:19 +0000 (19:13 +0800)]
block: allow bio_for_each_segment_all() to iterate over multi-page bvec
This patch introduces one extra iterator variable to bio_for_each_segment_all(),
then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.
Given it is just one mechannical & simple change on all bio_for_each_segment_all()
users, this patch does tree-wide change in one single patch, so that we can
avoid to use a temporary helper for this conversion.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 15 Feb 2019 11:13:18 +0000 (19:13 +0800)]
bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages()
bch_bio_alloc_pages() is always called on one new bio, so it is safe
to access the bvec table directly. Given it is the only kind of this
case, open code the bvec table access since bio_for_each_segment_all()
will be changed to support for iterating over multipage bvec.
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 15 Feb 2019 11:13:17 +0000 (19:13 +0800)]
block: loop: pass multi-page bvec to iov_iter
iov_iter is implemented on bvec itererator helpers, so it is safe to pass
multi-page bvec to it, and this way is much more efficient than passing one
page in each bvec.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 15 Feb 2019 11:13:16 +0000 (19:13 +0800)]
btrfs: use mp_bvec_last_segment to get bio's last page
Preparing for supporting multi-page bvec.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 15 Feb 2019 11:13:15 +0000 (19:13 +0800)]
fs/buffer.c: use bvec iterator to truncate the bio
Once multi-page bvec is enabled, the last bvec may include more than one
page, this patch use mp_bvec_last_segment() to truncate the bio.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 15 Feb 2019 11:13:14 +0000 (19:13 +0800)]
block: introduce mp_bvec_last_segment()
BTRFS and guard_bio_eod() need to get the last singlepage segment
from one multipage bvec, so introduce this helper to make them happy.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 15 Feb 2019 11:13:13 +0000 (19:13 +0800)]
block: use bio_for_each_bvec() to map sg
It is more efficient to use bio_for_each_bvec() to map sg, meantime
we have to consider splitting multipage bvec as done in blk_bio_segment_split().
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 15 Feb 2019 11:13:12 +0000 (19:13 +0800)]
block: use bio_for_each_bvec() to compute multi-page bvec count
First it is more efficient to use bio_for_each_bvec() in both
blk_bio_segment_split() and __blk_recalc_rq_segments() to compute how
many multi-page bvecs there are in the bio.
Secondly once bio_for_each_bvec() is used, the bvec may need to be
splitted because its length can be very longer than max segment size,
so we have to split the big bvec into several segments.
Thirdly when splitting multi-page bvec into segments, the max segment
limit may be reached, so the bio split need to be considered under
this situation too.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 15 Feb 2019 11:13:11 +0000 (19:13 +0800)]
block: introduce bio_for_each_bvec() and rq_for_each_bvec()
bio_for_each_bvec() is used for iterating over multi-page bvec for bio
split & merge code.
rq_for_each_bvec() can be used for drivers which may handle the
multi-page bvec directly, so far loop is one perfect use case.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 15 Feb 2019 11:13:10 +0000 (19:13 +0800)]
block: introduce multi-page bvec helpers
This patch introduces helpers of 'mp_bvec_iter_*' for multi-page bvec
support.
The introduced helpers treate one bvec as real multi-page segment,
which may include more than one pages.
The existed helpers of bvec_iter_* are interfaces for supporting current
bvec iterator which is thought as single-page by drivers, fs, dm and
etc. These introduced helpers will build single-page bvec in flight, so
this way won't break current bio/bvec users, which needn't any change.
Follows some multi-page bvec background:
- bvecs stored in bio->bi_io_vec is always multi-page style
- bvec(struct bio_vec) represents one physically contiguous I/O
buffer, now the buffer may include more than one page after
multi-page bvec is supported, and all these pages represented
by one bvec is physically contiguous. Before multi-page bvec
support, at most one page is included in one bvec, we call it
single-page bvec.
- .bv_page of the bvec points to the 1st page in the multi-page bvec
- .bv_offset of the bvec is the offset of the buffer in the bvec
The effect on the current drivers/filesystem/dm/bcache/...:
- almost everyone supposes that one bvec only includes one single
page, so we keep the sp interface not changed, for example,
bio_for_each_segment() still returns single-page bvec
- bio_for_each_segment_all() will return single-page bvec too
- during iterating, iterator variable(struct bvec_iter) is always
updated in multi-page bvec style, and bvec_iter_advance() is kept
not changed
- returned(copied) single-page bvec is built in flight by bvec
helpers from the stored multi-page bvec
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 15 Feb 2019 11:13:09 +0000 (19:13 +0800)]
block: remove bvec_iter_rewind()
Commit
7759eb23fd980 ("block: remove bio_rewind_iter()") removes
bio_rewind_iter(), then no one uses bvec_iter_rewind() any more,
so remove it.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 15 Feb 2019 11:13:08 +0000 (19:13 +0800)]
block: don't use bio->bi_vcnt to figure out segment number
It is wrong to use bio->bi_vcnt to figure out how many segments
there are in the bio even though CLONED flag isn't set on this bio,
because this bio may be splitted or advanced.
So always use bio_segments() in blk_recount_segments(), and it shouldn't
cause any performance loss now because the physical segment number is figured
out in blk_queue_split() and BIO_SEG_VALID is set meantime since
bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting").
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixes:
76d8137a3113 ("blk-merge: recaculate segment if it isn't less than max segments")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 15 Feb 2019 11:13:07 +0000 (19:13 +0800)]
btrfs: look at bi_size for repair decisions
bio_readpage_error currently uses bi_vcnt to decide if it is worth
retrying an I/O. But the vector count is mostly an implementation
artifact - it really should figure out if there is more than a
single sector worth retrying. Use bi_size for that and shift by
PAGE_SHIFT. This really should be blocks/sectors, but given that
btrfs doesn't support a sector size different from the PAGE_SIZE
using the page size keeps the changes to a minimum.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Aleksei Zakharov [Mon, 11 Feb 2019 10:50:37 +0000 (13:50 +0300)]
block: avoid setting none scheduler if it's already none
There's no reason to freeze queue and remove scheduler
if there's no scheduler already.
Signed-off-by: Aleksei Zakharov <zakharov.a.g@yandex.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Aleksei Zakharov [Mon, 11 Feb 2019 10:10:34 +0000 (13:10 +0300)]
block: avoid setting wbt_lat_usec to current value
There's no reason to set wbt min lat and freeze request queue
if current value is the same.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Aleksei Zakharov <zakharov.a.g@yandex.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Heiner Litz [Mon, 11 Feb 2019 12:25:09 +0000 (13:25 +0100)]
lightnvm: pblk: fix race condition on GC
This patch fixes a race condition where a write is mapped to the last
sectors of a line. The write is synced to the device but the L2P is not
updated yet. When the line is garbage collected before the L2P update
is performed, the sectors are ignored by the GC logic and the line is
freed before all sectors are moved. When the L2P is finally updated, it
contains a mapping to a freed line, subsequent reads of the
corresponding LBAs fail.
This patch introduces a per line counter specifying the number of
sectors that are synced to the device but have not been updated in the
L2P. Lines with a counter of greater than zero will not be selected
for GC.
Signed-off-by: Heiner Litz <hlitz@ucsc.edu>
Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Javier González [Mon, 11 Feb 2019 12:25:08 +0000 (13:25 +0100)]
lightnvm: pblk: prevent stall due to wb threshold
In order to respect mw_cuinits, pblk's write buffer maintains a
backpointer to protect data not yet persisted; when writing to the write
buffer, this backpointer defines a threshold that pblk's rate-limiter
enforces.
On small PU configurations, the following scenarios might take place: (i)
the threshold is larger than the write buffer and (ii) the threshold is
smaller than the write buffer, but larger than the maximun allowed
split bio - 256KB at this moment (Note that writes are not always
split - we only do this when we the size of the buffer is smaller
than the buffer). In both cases, pblk's rate-limiter prevents the I/O to
be written to the buffer, thus stalling.
This patch fixes the original backpointer implementation by considering
the threshold both on buffer creation and on the rate-limiters path,
when bio_split is triggered (case (ii) above).
Fixes:
766c8ceb16fc ("lightnvm: pblk: guarantee that backpointer is respected on writer stall")
Signed-off-by: Javier González <javier@javigon.com>
Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hans Holmberg [Mon, 11 Feb 2019 12:25:07 +0000 (13:25 +0100)]
lightnvm: pblk: extend line wp balance check
pblk stripes writes of minimal write size across all non-offline chunks
in a line, which means that the maximum write pointer delta should not
exceed the minimal write size.
Extend the line write pointer balance check to cover this case, and
ignore the offline chunk wps.
This will render us a warning during recovery if something unexpected
has happened to the chunk write pointers (i.e. powerloss, a spurious
chunk reset, ..).
Reported-by: Zhoujie Wu <zjwu@marvell.com>
Tested-by: Zhoujie Wu <zjwu@marvell.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Masahiro Yamada [Mon, 11 Feb 2019 12:25:06 +0000 (13:25 +0100)]
lightnvm: pblk: fix TRACE_INCLUDE_PATH
As the comment block in include/trace/define_trace.h says,
TRACE_INCLUDE_PATH should be a relative path to the define_trace.h
../../drivers/lightnvm is the correct relative path.
../../../drivers/lightnvm is working by coincidence because the top
Makefile adds -I$(srctree)/arch/$(SRCARCH)/include as a header
search path, but we should not rely on it.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Andy Shevchenko [Mon, 11 Feb 2019 12:25:05 +0000 (13:25 +0100)]
lightnvm: pblk: Switch to use new generic UUID API
There are new types and helpers that are supposed to be used in new code.
As a preparation to get rid of legacy types and API functions do
the conversion here.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Andy Shevchenko [Mon, 11 Feb 2019 12:25:04 +0000 (13:25 +0100)]
lightnvm: Use u64 instead of __le64 for CPU visible side
Sparse complains about using strict data types:
drivers/lightnvm/pblk-read.c:254:43: warning: incorrect type in assignment (different base types)
drivers/lightnvm/pblk-read.c:254:43: expected restricted __le64 <noident>
drivers/lightnvm/pblk-read.c:254:43: got unsigned long long [unsigned] [usertype] <noident>
drivers/lightnvm/pblk-read.c:255:29: warning: cast from restricted __le64
drivers/lightnvm/pblk-read.c:268:29: warning: cast from restricted __le64
drivers/lightnvm/pblk-read.c:328:41: warning: incorrect type in assignment (different base types)
drivers/lightnvm/pblk-read.c:328:41: expected restricted __le64 <noident>
drivers/lightnvm/pblk-read.c:328:41: got unsigned long long [unsigned] [usertype] <noident>
In the code it seems explicit that lba_list_mem and lba_list_media members
of struct pblk_pr_ctx are used on CPU side, which means they should not be
of strict types.
Change types of lba_list_mem and lba_list_media members to be u64.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hans Holmberg [Mon, 11 Feb 2019 12:25:03 +0000 (13:25 +0100)]
lightnvm: pblk: use vfree to free metadata on error path
As chunk metadata is allocated using vmalloc, we need to free it
using vfree.
Fixes:
090ee26fd512 ("lightnvm: use internal allocation for chunk log page")
Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hans Holmberg [Mon, 11 Feb 2019 12:25:02 +0000 (13:25 +0100)]
lightnvm: pblk: stop taking the free lock in in pblk_lines_free
pblk_line_meta_free might sleep (it can end up calling vfree, depending
on how we allocate lba lists), and this can lead to a BUG()
if we wake up on a different cpu and release the lock.
As there is no point of grabbing the free lock when pblk has shut down,
remove the lock.
Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Sun, 10 Feb 2019 22:42:20 +0000 (14:42 -0800)]
Linux 5.0-rc6
Linus Torvalds [Sun, 10 Feb 2019 18:39:37 +0000 (10:39 -0800)]
Merge tag 'dmaengine-fix-5.0-rc6' of git://git.infradead.org/users/vkoul/slave-dma
Pull dmaengine fixes from Vinod Koul:
- Fix in at_xdmac fr wrongful channel state
- Fix for imx driver for wrong callback invocation
- Fix to bcm driver for interrupt race & transaction abort.
- Fix in dmatest to abort in mapping error
* tag 'dmaengine-fix-5.0-rc6' of git://git.infradead.org/users/vkoul/slave-dma:
dmaengine: dmatest: Abort test in case of mapping error
dmaengine: bcm2835: Fix abort of transactions
dmaengine: bcm2835: Fix interrupt race on RT
dmaengine: imx-dma: fix wrong callback invoke
dmaengine: at_xdmac: Fix wrongfull report of a channel as in use
Linus Torvalds [Sun, 10 Feb 2019 17:57:42 +0000 (09:57 -0800)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"A handful of fixes:
- Fix an MCE corner case bug/crash found via MCE injection testing
- Fix 5-level paging boot crash
- Fix MCE recovery cache invalidation bug
- Fix regression on Xen guests caused by a recent PMD level mremap
speedup optimization"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Make set_pmd_at() paravirt aware
x86/mm/cpa: Fix set_mce_nospec()
x86/boot/compressed/64: Do not corrupt EDX on EFER.LME=1 setting
x86/MCE: Initialize mce.bank in the case of a fatal error in mce_no_way_out()
Linus Torvalds [Sun, 10 Feb 2019 17:54:19 +0000 (09:54 -0800)]
Merge branch 'irq-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull irq fixes from Ingo Molnar:
"irqchip driver fixes: most of them are race fixes for ARM GIC (General
Interrupt Controller) variants, but also a fix for the ARM MMP
(Marvell PXA168 et al) irqchip affecting OLPC keyboards"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v3-its: Fix ITT_entry_size accessor
irqchip/mmp: Only touch the PJ4 IRQ & FIQ bits on enable/disable
irqchip/gic-v3-its: Gracefully fail on LPI exhaustion
irqchip/gic-v3-its: Plug allocation race for devices sharing a DevID
irqchip/gic-v4: Fix occasional VLPI drop
Linus Torvalds [Sun, 10 Feb 2019 17:48:18 +0000 (09:48 -0800)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"A couple of kernel side fixes:
- Fix the Intel uncore driver on certain hardware configurations
- Fix a CPU hotplug related memory allocation bug
- Remove a spurious WARN()
... plus also a handful of perf tooling fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf script python: Add Python3 support to tests/attr.py
perf trace: Support multiple "vfs_getname" probes
perf symbols: Filter out hidden symbols from labels
perf symbols: Add fallback definitions for GELF_ST_VISIBILITY()
tools headers uapi: Sync linux/in.h copy from the kernel sources
perf clang: Do not use 'return std::move(something)'
perf mem/c2c: Fix perf_mem_events to support powerpc
perf tests evsel-tp-sched: Fix bitwise operator
perf/core: Don't WARN() for impossible ring-buffer sizes
perf/x86/intel: Delay memory deallocation until x86_pmu_dead_cpu()
perf/x86/intel/uncore: Add Node ID mask
Linus Torvalds [Sun, 10 Feb 2019 17:44:52 +0000 (09:44 -0800)]
Merge branch 'locking-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
"An rtmutex (PI-futex) deadlock scenario fix, plus a locking
documentation fix"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Handle early deadlock return correctly
futex: Fix barrier comment
Marcos Paulo de Souza [Sun, 10 Feb 2019 17:22:51 +0000 (15:22 -0200)]
blk-sysfs: Rework documention of __blk_release_queue
The Notes section of the comment was removed, because now
blk_release_queue can only be executed from blk_cleanup_queue (being
called when the q->kobj reaches zero), and also blk_init_queue was removed
in
a1ce35fa4985.
Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Marcos Paulo de Souza [Fri, 25 Jan 2019 02:01:42 +0000 (00:01 -0200)]
blk-cgroup: Fix doc related to blkcg_exit_queue
Since
4cf6324b17e9, a portion of function blk_cleanup_queue was moved to
a newly created function called blk_exit_queue, including the call of
blkcg_exit_queue. So, adjust the documenation according.
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Juergen Gross [Sun, 10 Feb 2019 07:40:56 +0000 (08:40 +0100)]
x86/mm: Make set_pmd_at() paravirt aware
set_pmd_at() calls native_set_pmd() unconditionally on x86. This was
fine as long as only huge page entries were written via set_pmd_at(),
as Xen pv guests don't support those.
Commit
2c91bd4a4e2e53 ("mm: speed up mremap by 20x on large regions")
introduced a usage of set_pmd_at() possible on pv guests, leading to
failures like:
BUG: unable to handle kernel paging request at
ffff888023e26778
#PF error: [PROT] [WRITE]
RIP: e030:move_page_tables+0x7c1/0xae0
move_vma.isra.3+0xd1/0x2d0
__se_sys_mremap+0x3c6/0x5b0
do_syscall_64+0x49/0x100
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Make set_pmd_at() paravirt aware by just letting it use set_pmd().
Fixes:
2c91bd4a4e2e53 ("mm: speed up mremap by 20x on large regions")
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: xen-devel@lists.xenproject.org
Cc: boris.ostrovsky@oracle.com
Cc: sstabellini@kernel.org
Cc: hpa@zytor.com
Cc: bp@alien8.de
Cc: torvalds@linux-foundation.org
Link: https://lkml.kernel.org/r/20190210074056.11842-1-jgross@suse.com
Jens Axboe [Sat, 9 Feb 2019 22:42:07 +0000 (15:42 -0700)]
block: queue flag cleanup
We have QUEUE_FLAG_DEFAULT defined, but it's not used anymore since
the legacy IO stack is gone. Kill it.
Sanitize the queue flags in general, they use spaces (for some
reason), and the space is pretty sparse. With the flags renumbered,
we can more clearly see how many we have available.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sat, 9 Feb 2019 22:40:24 +0000 (15:40 -0700)]
block: kill QUEUE_FLAG_FLUSH_NQ
We have various helpers for setting/clearing this flag, and also
a helper to check if the queue supports queueable flushes or not.
But nobody uses them anymore, kill it with fire.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Sat, 9 Feb 2019 21:43:12 +0000 (13:43 -0800)]
Merge branch 'i2c/for-current' of git://git./linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"One PM related driver bugfix and a MAINTAINERS update"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
MAINTAINERS: Update the ocores i2c bus driver maintainer, etc
i2c: omap: Use noirq system sleep pm ops to idle device for suspend
Linus Torvalds [Sat, 9 Feb 2019 20:41:14 +0000 (12:41 -0800)]
Merge tag 'mips_fixes_5.0_3' of git://git./linux/kernel/git/mips/linux
Pull MIPS fixes from Paul Burton:
"A batch of MIPS fixes for 5.0, nothing too scary.
- A workaround for a Loongson 3 CPU bug is the biggest change, but
still fairly straightforward. It adds extra memory barriers (sync
instructions) around atomics to avoid a CPU bug that can break
atomicity.
- Loongson64 also sees a fix for powering off some systems which
would incorrectly reboot rather than waiting for the power down
sequence to complete.
- We have DT fixes for the Ingenic JZ4740 SoC & the JZ4780-based Ci20
board, and a DT warning fix for the Nexsys4/MIPSfpga board.
- The Cavium Octeon platform sees a further fix to the behaviour of
the pcie_disable command line argument that was introduced in v3.3.
- The VDSO, introduced in v4.4, sees build fixes for configurations
of GCC that were built using the --with-fp-32= flag to specify a
default 32-bit floating point ABI.
- get_frame_info() sees a fix for configurations with
CONFIG_KALLSYMS=n, for which it previously always returned an
error.
- If the MIPS Coherence Manager (CM) reports an error then we'll now
clear that error correctly so that the GCR_ERROR_CAUSE register
will be updated with information about any future errors"
* tag 'mips_fixes_5.0_3' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
mips: cm: reprime error cause
mips: loongson64: remove unreachable(), fix loongson_poweroff().
MIPS: Remove function size check in get_frame_info()
MIPS: Use lower case for addresses in nexys4ddr.dts
MIPS: Loongson: Introduce and use loongson_llsc_mb()
MIPS: VDSO: Include $(ccflags-vdso) in o32,n32 .lds builds
MIPS: VDSO: Use same -m%-float cflag as the kernel proper
MIPS: OCTEON: don't set octeon_dma_bar_type if PCI is disabled
DTS: CI20: Fix bugs in ci20's device tree.
MIPS: DTS: jz4740: Correct interrupt number of DMA core
Linus Torvalds [Sat, 9 Feb 2019 18:26:09 +0000 (10:26 -0800)]
Merge tag 'for-linus-
20190209' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- NVMe pull request from Christoph, fixing namespace locking when
dealing with the effects log, and a rapid add/remove issue (Keith)
- blktrace tweak, ensuring requests with -1 sectors are shown (Jan)
- link power management quirk for a Smasung SSD (Hans)
- m68k nfblock dynamic major number fix (Chengguang)
- series fixing blk-iolatency inflight counter issue (Liu)
- ensure that we clear ->private when setting up the aio kiocb (Mike)
- __find_get_block_slow() rate limit print (Tetsuo)
* tag 'for-linus-
20190209' of git://git.kernel.dk/linux-block:
blk-mq: remove duplicated definition of blk_mq_freeze_queue
Blk-iolatency: warn on negative inflight IO counter
blk-iolatency: fix IO hang due to negative inflight counter
blktrace: Show requests without sector
fs: ratelimit __find_get_block_slow() failure message.
m68k: set proper major_num when specifying module param major_num
libata: Add NOLPM quirk for SAMSUNG MZ7TE512HMHP-000L1 SSD
nvme-pci: fix rapid add remove sequence
nvme: lock NS list changes while handling command effects
aio: initialize kiocb private in case any filesystems expect it.
Linus Torvalds [Sat, 9 Feb 2019 18:17:01 +0000 (10:17 -0800)]
Merge tag 'mtd/fixes-for-5.0-rc6' of git://git.infradead.org/linux-mtd
Pull mtd fixes from Boris Brezillon:
- Fix a problem with the imx28 ECC engine
- Remove a debug trace introduced in
2b6f0090a333 ("mtd: Check
add_mtd_device() ret code")
- Make sure partitions of size 0 can be registered
- Fix kernel-doc warning in the rawnand core
- Fix the error path of spinand_init() (missing manufacturer cleanup in
a few places)
- Address a problem with the SPI NAND PROGRAM LOAD operation which does
not work as expected on some parts.
* tag 'mtd/fixes-for-5.0-rc6' of git://git.infradead.org/linux-mtd:
mtd: rawnand: gpmi: fix MX28 bus master lockup problem
mtd: Make sure mtd->erasesize is valid even if the partition is of size 0
mtd: Remove a debug trace in mtdpart.c
mtd: rawnand: fix kernel-doc warnings
mtd: spinand: Fix the error/cleanup path in spinand_init()
mtd: spinand: Handle the case where PROGRAM LOAD does not reset the cache
Linus Torvalds [Sat, 9 Feb 2019 17:44:08 +0000 (09:44 -0800)]
Merge tag 'for-linus-5.0-rc6-tag' of git://git./linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"Two very minor fixes: one remove of a #include for an unused header
and a fix of the xen ML address in MAINTAINERS"
* tag 'for-linus-5.0-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
MAINTAINERS: unify reference to xen-devel list
arch/arm/xen: Remove duplicate header
Coly Li [Sat, 9 Feb 2019 04:53:11 +0000 (12:53 +0800)]
bcache: use (REQ_META|REQ_PRIO) to indicate bio for metadata
In 'commit
752f66a75aba ("bcache: use REQ_PRIO to indicate bio for
metadata")' REQ_META is replaced by REQ_PRIO to indicate metadata bio.
This assumption is not always correct, e.g. XFS uses REQ_META to mark
metadata bio other than REQ_PRIO. This is why Nix noticed that bcache
does not cache metadata for XFS after the above commit.
Thanks to Dave Chinner, he explains the difference between REQ_META and
REQ_PRIO from view of file system developer. Here I quote part of his
explanation from mailing list,
REQ_META is used for metadata. REQ_PRIO is used to communicate to
the lower layers that the submitter considers this IO to be more
important that non REQ_PRIO IO and so dispatch should be expedited.
IOWs, if the filesystem considers metadata IO to be more important
that user data IO, then it will use REQ_PRIO | REQ_META rather than
just REQ_META.
Then it seems bios with REQ_META or REQ_PRIO should both be cached for
performance optimation, because they are all probably low I/O latency
demand by upper layer (e.g. file system).
So in this patch, when we want to decide whether to bypass the cache,
REQ_META and REQ_PRIO are both checked. Then both metadata and
high priority I/O requests will be handled properly.
Reported-by: Nix <nix@esperi.org.uk>
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Andre Noll <maan@tuebingen.mpg.de>
Tested-by: Nix <nix@esperi.org.uk>
Cc: stable@vger.kernel.org
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Sat, 9 Feb 2019 04:53:10 +0000 (12:53 +0800)]
bcache: fix input overflow to cache set sysfs file io_error_halflife
Cache set sysfs entry io_error_halflife is used to set c->error_decay.
c->error_decay is in type unsigned int, and it is converted by
strtoul_or_return(), therefore overflow to c->error_decay is possible
for a large input value.
This patch fixes the overflow by using strtoul_safe_clamp() to convert
input string to an unsigned long value in range [0, UINT_MAX], then
divides by 88 and set it to c->error_decay.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Sat, 9 Feb 2019 04:53:09 +0000 (12:53 +0800)]
bcache: fix input overflow to cache set io_error_limit
c->error_limit is in type unsigned int, it is set via cache set sysfs
file io_error_limit. Inside the bcache code, input string is converted
by strtoul_or_return() and set the converted value to c->error_limit.
Because the converted value is unsigned long, and c->error_limit is
unsigned int, if the input is large enought, overflow will happen to
c->error_limit.
This patch uses sysfs_strtoul_clamp() to convert input string, and set
the range in [0, UINT_MAX] to avoid the potential overflow.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Sat, 9 Feb 2019 04:53:08 +0000 (12:53 +0800)]
bcache: fix input overflow to journal_delay_ms
c->journal_delay_ms is in type unsigned short, it is set via sysfs
interface and converted by sysfs_strtoul() from input string to
unsigned short value. Therefore overflow to unsigned short might be
happen when the converted value exceed USHRT_MAX. e.g. writing
65536 into sysfs file journal_delay_ms, c->journal_delay_ms is set to
0.
This patch uses sysfs_strtoul_clamp() to convert the input string and
limit value range in [0, USHRT_MAX], to avoid the input overflow.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Sat, 9 Feb 2019 04:53:07 +0000 (12:53 +0800)]
bcache: fix input overflow to writeback_rate_minimum
dc->writeback_rate_minimum is type unsigned integer variable, it is set
via sysfs interface, and converte from input string to unsigned integer
by d_strtoul_nonzero(). When the converted input value is larger than
UINT_MAX, overflow to unsigned integer happens.
This patch fixes the overflow by using sysfs_strotoul_clamp() to
convert input string and limit the value in range [1, UINT_MAX], then
the overflow can be avoided.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Sat, 9 Feb 2019 04:53:06 +0000 (12:53 +0800)]
bcache: fix potential div-zero error of writeback_rate_p_term_inverse
Current code already uses d_strtoul_nonzero() to convert input string
to an unsigned integer, to make sure writeback_rate_p_term_inverse
won't be zero value. But overflow may happen when converting input
string to an unsigned integer value by d_strtoul_nonzero(), then
dc->writeback_rate_p_term_inverse can still be set to 0 even if the
sysfs file input value is not zero, e.g.
4294967296 (a.k.a UINT_MAX+1).
If dc->writeback_rate_p_term_inverse is set to 0, it might cause a
dev-zero error in following code from __update_writeback_rate(),
int64_t proportional_scaled =
div_s64(error, dc->writeback_rate_p_term_inverse);
This patch replaces d_strtoul_nonzero() by sysfs_strtoul_clamp() and
limit the value range in [1, UINT_MAX]. Then the unsigned integer
overflow and dev-zero error can be avoided.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Sat, 9 Feb 2019 04:53:05 +0000 (12:53 +0800)]
bcache: fix potential div-zero error of writeback_rate_i_term_inverse
dc->writeback_rate_i_term_inverse can be set via sysfs interface. It is
in type unsigned int, and convert from input string by d_strtoul(). The
problem is d_strtoul() does not check valid range of the input, if
4294967296 is written into sysfs file writeback_rate_i_term_inverse,
an overflow of unsigned integer will happen and value 0 is set to
dc->writeback_rate_i_term_inverse.
In writeback.c:__update_writeback_rate(), there are following lines of
code,
integral_scaled = div_s64(dc->writeback_rate_integral,
dc->writeback_rate_i_term_inverse);
If dc->writeback_rate_i_term_inverse is set to 0 via sysfs interface,
a div-zero error might be triggered in the above code.
Therefore we need to add a range limitation in the sysfs interface,
this is what this patch does, use sysfs_stroul_clamp() to replace
d_strtoul() and restrict the input range in [1, UINT_MAX].
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Sat, 9 Feb 2019 04:53:04 +0000 (12:53 +0800)]
bcache: fix input overflow to writeback_delay
Sysfs file writeback_delay is used to configure dc->writeback_delay
which is type unsigned int. But bcache code uses sysfs_strtoul() to
convert the input string, therefore it might be overflowed if the input
value is too large. E.g. input value is
4294967296 but indeed 0 is
set to dc->writeback_delay.
This patch uses sysfs_strtoul_clamp() to convert the input string and
set the result value range in [0, UINT_MAX] to avoid such unsigned
integer overflow.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Sat, 9 Feb 2019 04:53:03 +0000 (12:53 +0800)]
bcache: use sysfs_strtoul_bool() to set bit-field variables
When setting bcache parameters via sysfs, there are some variables are
defined as bit-field value. Current bcache code in sysfs.c uses either
d_strtoul() or sysfs_strtoul() to convert the input string to unsigned
integer value and set it to the corresponded bit-field value.
The problem is, the bit-field value only takes the lowest bit of the
converted value. If input is 2, the expected value (like bool value)
of the bit-field value should be 1, but indeed it is 0.
The following sysfs files for bit-field variables have such problem,
bypass_torture_test, for dc->bypass_torture_test
writeback_metadata, for dc->writeback_metadata
writeback_running, for dc->writeback_running
verify, for c->verify
key_merging_disabled, for c->key_merging_disabled
gc_always_rewrite, for c->gc_always_rewrite
btree_shrinker_disabled,for c->shrinker_disabled
copy_gc_enabled, for c->copy_gc_enabled
This patch uses sysfs_strtoul_bool() to set such bit-field variables,
then if the converted value is non-zero, the bit-field variables will
be set to 1, like setting a bool value like expensive_debug_checks.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Sat, 9 Feb 2019 04:53:02 +0000 (12:53 +0800)]
bcache: add sysfs_strtoul_bool() for setting bit-field variables
When setting bool values via sysfs interface, e.g. writeback_metadata,
if writing 1 into writeback_metadata file, dc->writeback_metadata is
set to 1, but if writing 2 into the file, dc->writeback_metadata is
0. This is misleading, a better result should be 1 for all non-zero
input value.
It is because dc->writeback_metadata is a bit-field variable, and
current code simply use d_strtoul() to convert a string into integer
and takes the lowest bit value. To fix such error, we need a routine
to convert the input string into unsigned integer, and set target
variable to 1 if the converted integer is non-zero.
This patch introduces a new macro called sysfs_strtoul_bool(), it can
be used to convert input string into bool value, we can use it to set
bool value for bit-field vairables.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Sat, 9 Feb 2019 04:53:01 +0000 (12:53 +0800)]
bcache: fix input overflow to sequential_cutoff
People may set sequential_cutoff of a cached device via sysfs file,
but current code does not check input value overflow. E.g. if value
4294967295 (UINT_MAX) is written to file sequential_cutoff, its value
is 4GB, but if
4294967296 (UINT_MAX + 1) is written into, its value
will be 0. This is an unexpected behavior.
This patch replaces d_strtoi_h() by sysfs_strtoul_clamp() to convert
input string to unsigned integer value, and limit its range in
[0, UINT_MAX]. Then the input overflow can be fixed.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Sat, 9 Feb 2019 04:53:00 +0000 (12:53 +0800)]
bcache: fix input integer overflow of congested threshold
Cache set congested threshold values congested_read_threshold_us and
congested_write_threshold_us can be set via sysfs interface. These
two values are 'unsigned int' type, but sysfs interface uses strtoul
to convert input string. So if people input a large number like
9999999999, the value indeed set is
1410065407, which is not expected
behavior.
This patch replaces sysfs_strtoul() by sysfs_strtoul_clamp() when
convert input string to unsigned int value, and set value range in
[0, UINT_MAX], to avoid the above integer overflow errors.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Sat, 9 Feb 2019 04:52:59 +0000 (12:52 +0800)]
bcache: improve sysfs_strtoul_clamp()
Currently sysfs_strtoul_clamp() is defined as,
82 #define sysfs_strtoul_clamp(file, var, min, max) \
83 do { \
84 if (attr == &sysfs_ ## file) \
85 return strtoul_safe_clamp(buf, var, min, max) \
86 ?: (ssize_t) size; \
87 } while (0)
The problem is, if bit width of var is less then unsigned long, min and
max may not protect var from integer overflow, because overflow happens
in strtoul_safe_clamp() before checking min and max.
To fix such overflow in sysfs_strtoul_clamp(), to make min and max take
effect, this patch adds an unsigned long variable, and uses it to macro
strtoul_safe_clamp() to convert an unsigned long value in range defined
by [min, max]. Then assign this value to var. By this method, if bit
width of var is less than unsigned long, integer overflow won't happen
before min and max are checking.
Now sysfs_strtoul_clamp() can properly handle smaller data type like
unsigned int, of cause min and max should be defined in range of
unsigned int too.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tang Junhui [Sat, 9 Feb 2019 04:52:58 +0000 (12:52 +0800)]
bcache: treat stale && dirty keys as bad keys
Stale && dirty keys can be produced in the follow way:
After writeback in write_dirty_finish(), dirty keys k1 will
replace by clean keys k2
==>ret = bch_btree_insert(dc->disk.c, &keys, NULL, &w->key);
==>btree_insert_fn(struct btree_op *b_op, struct btree *b)
==>static int bch_btree_insert_node(struct btree *b,
struct btree_op *op,
struct keylist *insert_keys,
atomic_t *journal_ref,
Then two steps:
A) update k1 to k2 in btree node memory;
bch_btree_insert_keys(b, op, insert_keys, replace_key)
B) Write the bset(contains k2) to cache disk by a 30s delay work
bch_btree_leaf_dirty(b, journal_ref).
But before the 30s delay work write the bset to cache device,
these things happened:
A) GC works, and reclaim the bucket k2 point to;
B) Allocator works, and invalidate the bucket k2 point to,
and increase the gen of the bucket, and place it into free_inc
fifo;
C) Until now, the 30s delay work still does not finish work,
so in the disk, the key still is k1, it is dirty and stale
(its gen is smaller than the gen of the bucket). and then the
machine power off suddenly happens;
D) When the machine power on again, after the btree reconstruction,
the stale dirty key appear.
In bch_extent_bad(), when expensive_debug_checks is off, it would
treat the dirty key as good even it is stale keys, and it would
cause bellow probelms:
A) In read_dirty() it would cause machine crash:
BUG_ON(ptr_stale(dc->disk.c, &w->key, 0));
B) It could be worse when reads hits stale dirty keys, it would
read old incorrect data.
This patch tolerate the existence of these stale && dirty keys,
and treat them as bad key in bch_extent_bad().
(Coly Li: fix indent which was modified by sender's email client)
Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Colin Ian King [Sat, 9 Feb 2019 04:52:57 +0000 (12:52 +0800)]
bcache: fix indentation issue, remove tabs on a hunk of code
There is a hunk of code that is indented one level too deep, fix this
by removing the extra tabs.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Sat, 9 Feb 2019 04:52:56 +0000 (12:52 +0800)]
bcache: export backing_dev_uuid via sysfs
When there are multiple bcache devices, after a reboot the name of
bcache devices may change (e.g. current /dev/bcache1 was /dev/bcache0
before reboot). Therefore we need the backing device UUID (sb.uuid) to
identify each bcache device.
Backing device uuid can be found by program bcache-super-show, but
directly exporting backing_dev_uuid by sysfs file
/sys/block/bcache<?>/bcache/backing_dev_uuid is a much simpler method.
With backing_dev_uuid, and partition uuids from /dev/disk/by-partuuid/,
now we can identify each bcache device and its partitions conveniently.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Sat, 9 Feb 2019 04:52:55 +0000 (12:52 +0800)]
bcache: export backing_dev_name via sysfs
This patch export dc->backing_dev_name to sysfs file
/sys/block/bcache<?>/bcache/backing_dev_name, then people or user space
tools may know the backing device name of this bcache device.
Of cause it can be done by parsing sysfs links, but this method can be
much simpler to find the link between bcache device and backing device.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Sat, 9 Feb 2019 04:52:54 +0000 (12:52 +0800)]
bcache: not use hard coded memset size in bch_cache_accounting_clear()
In stats.c:bch_cache_accounting_clear(), a hard coded number '7' is
used in memset(). It is because in struct cache_stats, there are 7
atomic_t type members. This is not good when new members added into
struct stats, the hard coded number will only clear part of memory.
This patch replaces 'sizeof(unsigned long) * 7' by more generic
'sizeof(struct cache_stats))', to avoid potential error if new
member added into struct cache_stats.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Daniel Axtens [Sat, 9 Feb 2019 04:52:53 +0000 (12:52 +0800)]
bcache: never writeback a discard operation
Some users see panics like the following when performing fstrim on a
bcached volume:
[ 529.803060] BUG: unable to handle kernel NULL pointer dereference at
0000000000000008
[ 530.183928] #PF error: [normal kernel read fault]
[ 530.412392] PGD
8000001f42163067 P4D
8000001f42163067 PUD
1f42168067 PMD 0
[ 530.750887] Oops: 0000 [#1] SMP PTI
[ 530.920869] CPU: 10 PID: 4167 Comm: fstrim Kdump: loaded Not tainted 5.0.0-rc1+ #3
[ 531.290204] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
[ 531.693137] RIP: 0010:blk_queue_split+0x148/0x620
[ 531.922205] Code: 60 38 89 55 a0 45 31 db 45 31 f6 45 31 c9 31 ff 89 4d 98 85 db 0f 84 7f 04 00 00 44 8b 6d 98 4c 89 ee 48 c1 e6 04 49 03 70 78 <8b> 46 08 44 8b 56 0c 48
8b 16 44 29 e0 39 d8 48 89 55 a8 0f 47 c3
[ 532.838634] RSP: 0018:
ffffb9b708df39b0 EFLAGS:
00010246
[ 533.093571] RAX:
00000000ffffffff RBX:
0000000000046000 RCX:
0000000000000000
[ 533.441865] RDX:
0000000000000200 RSI:
0000000000000000 RDI:
0000000000000000
[ 533.789922] RBP:
ffffb9b708df3a48 R08:
ffff940d3b3fdd20 R09:
0000000000000000
[ 534.137512] R10:
ffffb9b708df3958 R11:
0000000000000000 R12:
0000000000000000
[ 534.485329] R13:
0000000000000000 R14:
0000000000000000 R15:
ffff940d39212020
[ 534.833319] FS:
00007efec26e3840(0000) GS:
ffff940d1f480000(0000) knlGS:
0000000000000000
[ 535.224098] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 535.504318] CR2:
0000000000000008 CR3:
0000001f4e256004 CR4:
00000000001606e0
[ 535.851759] Call Trace:
[ 535.970308] ? mempool_alloc_slab+0x15/0x20
[ 536.174152] ? bch_data_insert+0x42/0xd0 [bcache]
[ 536.403399] blk_mq_make_request+0x97/0x4f0
[ 536.607036] generic_make_request+0x1e2/0x410
[ 536.819164] submit_bio+0x73/0x150
[ 536.980168] ? submit_bio+0x73/0x150
[ 537.149731] ? bio_associate_blkg_from_css+0x3b/0x60
[ 537.391595] ? _cond_resched+0x1a/0x50
[ 537.573774] submit_bio_wait+0x59/0x90
[ 537.756105] blkdev_issue_discard+0x80/0xd0
[ 537.959590] ext4_trim_fs+0x4a9/0x9e0
[ 538.137636] ? ext4_trim_fs+0x4a9/0x9e0
[ 538.324087] ext4_ioctl+0xea4/0x1530
[ 538.497712] ? _copy_to_user+0x2a/0x40
[ 538.679632] do_vfs_ioctl+0xa6/0x600
[ 538.853127] ? __do_sys_newfstat+0x44/0x70
[ 539.051951] ksys_ioctl+0x6d/0x80
[ 539.212785] __x64_sys_ioctl+0x1a/0x20
[ 539.394918] do_syscall_64+0x5a/0x110
[ 539.568674] entry_SYSCALL_64_after_hwframe+0x44/0xa9
We have observed it where both:
1) LVM/devmapper is involved (bcache backing device is LVM volume) and
2) writeback cache is involved (bcache cache_mode is writeback)
On one machine, we can reliably reproduce it with:
# echo writeback > /sys/block/bcache0/bcache/cache_mode
(not sure whether above line is required)
# mount /dev/bcache0 /test
# for i in {0..10}; do
file="$(mktemp /test/zero.XXX)"
dd if=/dev/zero of="$file" bs=1M count=256
sync
rm $file
done
# fstrim -v /test
Observing this with tracepoints on, we see the following writes:
fstrim-18019 [022] .... 91107.302026: bcache_write:
73f95583-561c-408f-a93a-
4cbd2498f5c8 inode 0 DS 4260112 + 196352 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302050: bcache_write:
73f95583-561c-408f-a93a-
4cbd2498f5c8 inode 0 DS 4456464 + 262144 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302075: bcache_write:
73f95583-561c-408f-a93a-
4cbd2498f5c8 inode 0 DS 4718608 + 81920 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302094: bcache_write:
73f95583-561c-408f-a93a-
4cbd2498f5c8 inode 0 DS 5324816 + 180224 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302121: bcache_write:
73f95583-561c-408f-a93a-
4cbd2498f5c8 inode 0 DS 5505040 + 262144 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302145: bcache_write:
73f95583-561c-408f-a93a-
4cbd2498f5c8 inode 0 DS 5767184 + 81920 hit 0 bypass 1
fstrim-18019 [022] .... 91107.308777: bcache_write:
73f95583-561c-408f-a93a-
4cbd2498f5c8 inode 0 DS 6373392 + 180224 hit 1 bypass 0
<crash>
Note the final one has different hit/bypass flags.
This is because in should_writeback(), we were hitting a case where
the partial stripe condition was returning true and so
should_writeback() was returning true early.
If that hadn't been the case, it would have hit the would_skip test, and
as would_skip == s->iop.bypass == true, should_writeback() would have
returned false.
Looking at the git history from 'commit
72c270612bd3 ("bcache: Write out
full stripes")', it looks like the idea was to optimise for raid5/6:
* If a stripe is already dirty, force writes to that stripe to
writeback mode - to help build up full stripes of dirty data
To fix this issue, make sure that should_writeback() on a discard op
never returns true.
More details of debugging:
https://www.spinics.net/lists/linux-bcache/msg06996.html
Previous reports:
- https://bugzilla.kernel.org/show_bug.cgi?id=201051
- https://bugzilla.kernel.org/show_bug.cgi?id=196103
- https://www.spinics.net/lists/linux-bcache/msg06885.html
(Coly Li: minor modification to follow maximum 75 chars per line rule)
Cc: Kent Overstreet <koverstreet@google.com>
Cc: stable@vger.kernel.org
Fixes:
72c270612bd3 ("bcache: Write out full stripes")
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ingo Molnar [Sat, 9 Feb 2019 12:13:45 +0000 (13:13 +0100)]
Merge tag 'perf-urgent-for-mingo-5.0-
20190205' of git://git./linux/kernel/git/acme/linux into perf/urgent
Pull perf/urgent fixes from Arnaldo Carvalho de Melo:
perf trace:
Arnaldo Carvalho de Melo:
Fix handling of probe:vfs_getname when the probed routine is
inlined in multiple places, fixing the collection of the 'filename'
parameter in open syscalls.
perf test:
Gustavo A. R. Silva:
Fix bitwise operator usage in evsel-tp-sched test, which made tat
test always detect fields as signed.
Jiri Olsa:
Filter out hidden symbols from labels, added in systems where the
annobin plugin is used, such as RHEL8, which, if left in place make
the DWARF unwind 'perf test' to fail on PPC.
Tony Jones:
Fix 'perf_event_attr' tests when building with python3.
perf mem/c2c:
Ravi Bangoria:
Fix perf_mem_events on PowerPC.
tools headers UAPI:
Arnaldo Carvalho de Melo:
Sync linux/in.h copy from the kernel sources, silencing a perf build warning.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Linus Torvalds [Sat, 9 Feb 2019 00:23:41 +0000 (16:23 -0800)]
Merge tag 'armsoc-fixes-5.0' of git://git./linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
"This is a bit larger than normal, as we had not managed to send out a
pull request before traveling for a week without my signing key.
There are multiple code fixes for older bugs, all of which should get
backported into stable kernels:
- tango: one fix for multiplatform configurations broken on other
platforms when tango is enabled
- arm_scmi: device unregistration fix
- iop32x: fix kernel oops from extraneous __init annotation
- pxa: remove a double kfree
- fsl qbman: close an interrupt clearing race
The rest is the usual collection of smaller fixes for device tree
files, on the renesas, allwinner, meson, omap, davinci, qualcomm and
imx platforms.
Some of these are for compile-time warnings, most are for board
specific functionality that fails to work because of incorrect
settings"
* tag 'armsoc-fixes-5.0' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (30 commits)
ARM: tango: Improve ARCH_MULTIPLATFORM compatibility
firmware: arm_scmi: provide the mandatory device release callback
ARM: iop32x/n2100: fix PCI IRQ mapping
arm64: dts: add msm8996 compatible to gicv3
ARM: dts: am335x-shc.dts: fix wrong cd pin level
ARM: dts: n900: fix mmc1 card detect gpio polarity
ARM: dts: omap3-gta04: Fix graph_port warning
ARM: pxa: ssp: unneeded to free devm_ allocated data
ARM: dts: r8a7743: Convert to new LVDS DT bindings
soc: fsl: qbman: avoid race in clearing QMan interrupt
arm64: dts: renesas: r8a77965: Enable DMA for SCIF2
arm64: dts: renesas: r8a7796: Enable DMA for SCIF2
arm64: dts: renesas: r8a774a1: Enable DMA for SCIF2
ARM: dts: da850: fix interrupt numbers for clocksource
dt-bindings: imx8mq: Number clocks consecutively
arm64: dts: meson: Fix mmc cd-gpios polarity
ARM: dts: imx6sx: correct backward compatible of gpt
ARM: dts: imx: replace gpio-key,wakeup with wakeup-source property
ARM: dts: vf610-bk4: fix incorrect #address-cells for dspi3
ARM: dts: meson8m2: mxiii-plus: mark the SD card detection GPIO active-low
...
Linus Torvalds [Sat, 9 Feb 2019 00:21:33 +0000 (16:21 -0800)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"Two arm64 fixes for -rc6. They resolve a kernel NULL dereference in
kexec and bogus kernel page table dumping when userspace is configured
for 52-bit virtual addressing.
Summary:
- Fix kernel oops when attemping kexec_file() with a NULL cmdline
- Fix page table output in debugfs when ARM64_USER_VA_BITS_52=y"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: kexec_file: handle empty command-line
arm64: ptdump: Don't iterate kernel page tables using PTRS_PER_PXX
Linus Torvalds [Sat, 9 Feb 2019 00:04:12 +0000 (16:04 -0800)]
Merge tag 'powerpc-5.0-4' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Just two fixes, both going to stable.
- Our support for split pmd page table lock had a bug which could
lead to a crash on mremap() when using the Radix MMU (Power9 only).
- A fix for the PAPR SCM driver (nvdimm) we added last release, which
had a bug where we might mis-handle a hypervisor response leading
to us failing to attach the memory region.
Thanks to: Aneesh Kumar K.V, Oliver O'Halloran"
* tag 'powerpc-5.0-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/papr_scm: Use the correct bind address
powerpc/radix: Fix kernel crash with mremap()
Linus Torvalds [Fri, 8 Feb 2019 23:39:28 +0000 (15:39 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/ebiederm/user-namespace
Pull signal fixes from Eric Biederman:
"This contains four small fixes for signal handling. A missing range
check, a regression fix, prioritizing signals we have already started
a signal group exit for, and better detection of synchronous signals.
The confused decision of which signals to handle failed spectacularly
when a timer was pointed at SIGBUS and the stack overflowed. Resulting
in an unkillable process in an infinite loop instead of a SIGSEGV and
core dump"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
signal: Better detection of synchronous signals
signal: Always notice exiting tasks
signal: Always attempt to allocate siginfo for SIGSTOP
signal: Make siginmask safe when passed a signal of 0
Linus Torvalds [Fri, 8 Feb 2019 23:37:17 +0000 (15:37 -0800)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"This is a set of five minor fixes (although, tecnhincally, the aicxxx
fix is for a major problem in that the driver won't load without it,
but I think the fact it's taken us since 4.10 to discover this
indicates that the user base for these things has declined)"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: cxlflash: Prevent deadlock when adapter probe fails
Revert "scsi: libfc: Add WARN_ON() when deleting rports"
scsi: sd_zbc: Fix zone information messages
scsi: target: make the pi_prot_format ConfigFS path readable
scsi: aic94xx: fix module loading
Linus Torvalds [Fri, 8 Feb 2019 23:34:10 +0000 (15:34 -0800)]
Merge tag 'iommu-fixes-v5.0-rc5' of git://git./linux/kernel/git/joro/iommu
Pull IOMMU fix from Joerg Roedel:
"Intel decided to leave the newly added Scalable Mode Feature
default-disabled for now. The patch here accomplishes that"
* tag 'iommu-fixes-v5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/vt-d: Leave scalable mode default off
Linus Torvalds [Fri, 8 Feb 2019 23:32:10 +0000 (15:32 -0800)]
Merge tag 'pci-v5.0-fixes-4' of git://git./linux/kernel/git/helgaas/pci
Pull PCI fix from Bjorn Helgaas:
"Work around Synopsys duplicate Device ID (HAPS USB3, NXP i.MX) that
breaks PCIe on I.MX SoCs (Thinh Nguyen)"
* tag 'pci-v5.0-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI: Work around Synopsys duplicate Device ID (HAPS USB3, NXP i.MX)
Linus Torvalds [Fri, 8 Feb 2019 23:30:02 +0000 (15:30 -0800)]
Merge tag 'acpi-5.0-rc6' of git://git./linux/kernel/git/rafael/linux-pm
Pull ACPI fix from Rafael Wysocki:
"This prevents excessive ACPI debug messages from being printed to the
kernel log, which has started to happen after one of the recent ACPICA
commits (Erik Schmauss)"
* tag 'acpi-5.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: Set debug output flags independent of ACPICA
Andrew Lunn [Fri, 8 Feb 2019 16:56:31 +0000 (17:56 +0100)]
MAINTAINERS: Update the ocores i2c bus driver maintainer, etc
The listed maintainer has not been responding to emails for a while.
Add myself as a second maintainer.
Add the platform data include file, which was not listed.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Aleksei Zakharov [Fri, 8 Feb 2019 16:14:05 +0000 (19:14 +0300)]
block: avoid setting nr_requests to current value
There's no reason to freeze queue and set nr_requests value
if current value is the same.
Signed-off-by: Aleksei Zakharov <zakharov.a.g@yandex.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Liu Bo [Fri, 25 Jan 2019 00:12:49 +0000 (08:12 +0800)]
blk-mq: remove duplicated definition of blk_mq_freeze_queue
As the prototype has been defined in "include/linux/blk-mq.h", the one
in "block/blk-mq.h" can be removed then.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Liu Bo [Fri, 25 Jan 2019 00:12:48 +0000 (08:12 +0800)]
Blk-iolatency: warn on negative inflight IO counter
This is to catch any unexpected negative value of inflight IO counter.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Liu Bo [Fri, 25 Jan 2019 00:12:47 +0000 (08:12 +0800)]
blk-iolatency: fix IO hang due to negative inflight counter
Our test reported the following stack, and vmcore showed that
->inflight counter is -1.
[
ffffc9003fcc38d0] __schedule at
ffffffff8173d95d
[
ffffc9003fcc3958] schedule at
ffffffff8173de26
[
ffffc9003fcc3970] io_schedule at
ffffffff810bb6b6
[
ffffc9003fcc3988] blkcg_iolatency_throttle at
ffffffff813911cb
[
ffffc9003fcc3a20] rq_qos_throttle at
ffffffff813847f3
[
ffffc9003fcc3a48] blk_mq_make_request at
ffffffff8137468a
[
ffffc9003fcc3b08] generic_make_request at
ffffffff81368b49
[
ffffc9003fcc3b68] submit_bio at
ffffffff81368d7d
[
ffffc9003fcc3bb8] ext4_io_submit at
ffffffffa031be00 [ext4]
[
ffffc9003fcc3c00] ext4_writepages at
ffffffffa03163de [ext4]
[
ffffc9003fcc3d68] do_writepages at
ffffffff811c49ae
[
ffffc9003fcc3d78] __filemap_fdatawrite_range at
ffffffff811b6188
[
ffffc9003fcc3e30] filemap_write_and_wait_range at
ffffffff811b6301
[
ffffc9003fcc3e60] ext4_sync_file at
ffffffffa030cee8 [ext4]
[
ffffc9003fcc3ea8] vfs_fsync_range at
ffffffff8128594b
[
ffffc9003fcc3ee8] do_fsync at
ffffffff81285abd
[
ffffc9003fcc3f18] sys_fsync at
ffffffff81285d50
[
ffffc9003fcc3f28] do_syscall_64 at
ffffffff81003c04
[
ffffc9003fcc3f50] entry_SYSCALL_64_after_swapgs at
ffffffff81742b8e
The ->inflight counter may be negative (-1) if
1) blk-iolatency was disabled when the IO was issued,
2) blk-iolatency was enabled before this IO reached its endio,
3) the ->inflight counter is decreased from 0 to -1 in endio()
In fact the hang can be easily reproduced by the below script,
H=/sys/fs/cgroup/unified/
P=/sys/fs/cgroup/unified/test
echo "+io" > $H/cgroup.subtree_control
mkdir -p $P
echo $$ > $P/cgroup.procs
xfs_io -f -d -c "pwrite 0 4k" /dev/sdg
echo "`cat /sys/block/sdg/dev` target=1000000" > $P/io.latency
xfs_io -f -d -c "pwrite 0 4k" /dev/sdg
This fixes the problem by freezing the queue so that while
enabling/disabling iolatency, there is no inflight rq running.
Note that quiesce_queue is not needed as this only updating iolatency
configuration about which dispatching request_queue doesn't care.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Fri, 8 Feb 2019 19:21:54 +0000 (11:21 -0800)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
"This pull request is dedicated to the upcoming snowpocalypse parts 2
and 3 in the Pacific Northwest:
1) Drop profiles are broken because some drivers use dev_kfree_skb*
instead of dev_consume_skb*, from Yang Wei.
2) Fix IWLWIFI kconfig deps, from Luca Coelho.
3) Fix percpu maps updating in bpftool, from Paolo Abeni.
4) Missing station release in batman-adv, from Felix Fietkau.
5) Fix some networking compat ioctl bugs, from Johannes Berg.
6) ucc_geth must reset the BQL queue state when stopping the device,
from Mathias Thore.
7) Several XDP bug fixes in virtio_net from Toshiaki Makita.
8) TSO packets must be sent always on queue 0 in stmmac, from Jose
Abreu.
9) Fix socket refcounting bug in RDS, from Eric Dumazet.
10) Handle sparse cpu allocations in bpf selftests, from Martynas
Pumputis.
11) Make sure mgmt frames have enough tailroom in mac80211, from Felix
Feitkau.
12) Use safe list walking in sctp_sendmsg() asoc list traversal, from
Greg Kroah-Hartman.
13) Make DCCP's ccid_hc_[rt]x_parse_options always check for NULL
ccid, from Eric Dumazet.
14) Need to reload WoL password into bcmsysport device after deep
sleeps, from Florian Fainelli.
15) Remove filter from mask before freeing in cls_flower, from Petr
Machata.
16) Missing release and use after free in error paths of s390 qeth
code, from Julian Wiedmann.
17) Fix lockdep false positive in dsa code, from Marc Zyngier.
18) Fix counting of ATU violations in mv88e6xxx, from Andrew Lunn.
19) Fix EQ firmware assert in qed driver, from Manish Chopra.
20) Don't default Caivum PTP to Y in kconfig, from Bjorn Helgaas"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits)
net: dsa: b53: Fix for failure when irq is not defined in dt
sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach()
geneve: should not call rt6_lookup() when ipv6 was disabled
net: Don't default Cavium PTP driver to 'y'
net: broadcom: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
net: via-velocity: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
net: tehuti: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
net: sun: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
net: fsl_ucc_hdlc: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
net: fec_mpc52xx: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
net: smsc: epic100: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
net: dscc4: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
net: tulip: de2104x: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
net: defxx: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
net/mlx5e: Don't overwrite pedit action when multiple pedit used
net/mlx5e: Update hw flows when encap source mac changed
qed*: Advance drivers version to 8.37.0.20
qed: Change verbosity for coalescing message.
qede: Fix system crash on configuring channels.
qed: Consider TX tcs while deriving the max num_queues for PF.
...
Linus Torvalds [Fri, 8 Feb 2019 18:56:31 +0000 (10:56 -0800)]
Merge tag 'char-misc-5.0-rc6' of git://git./linux/kernel/git/gregkh/char-misc
Pull char/misc fixes from Greg KH:
"Here are some small char and misc driver fixes for 5.0-rc6.
Nothing huge here, some more binderfs fixups found as people use it,
and there is a "large" selftest added to validate the binderfs code,
which makes up the majority of this pull request.
There's also some small mei and mic fixes to resolve some reported
issues.
All of these have been in linux-next for over a week with no reported
issues"
* tag 'char-misc-5.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
mic: vop: Fix crash on remove
mic: vop: Fix use-after-free on remove
binderfs: remove separate device_initcall()
fpga: stratix10-soc: fix wrong of_node_put() in init function
mic: vop: Fix broken virtqueues
mei: free read cb on ctrl_wr list flush
samples: mei: use /dev/mei0 instead of /dev/mei
mei: me: add ice lake point device id.
binderfs: respect limit on binder control creation
binder: fix CONFIG_ANDROID_BINDER_DEVICES
selftests: add binderfs selftests
Linus Torvalds [Fri, 8 Feb 2019 18:53:44 +0000 (10:53 -0800)]
Merge tag 'driver-core-5.0-rc6' of git://git./linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are some driver core fixes for 5.0-rc6.
Well, not so much "driver core" as "debugfs". There's a lot of
outstanding debugfs cleanup patches coming in through different
subsystem trees, and in that process the debugfs core was found that
it really should return errors when something bad happens, to prevent
random files from showing up in the root of debugfs afterward. So
debugfs was fixed up to handle this properly, and then two fixes for
the relay and blk-mq code was needed as it was making invalid
assumptions about debugfs return values.
There's also a cacheinfo fix in here that resolves a tiny issue.
All of these have been in linux-next for over a week with no reported
problems"
* tag 'driver-core-5.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
blk-mq: protect debugfs_create_files() from failures
relay: check return of create_buf_file() properly
debugfs: debugfs_lookup() should return NULL if not found
debugfs: return error values, not NULL
debugfs: fix debugfs_rename parameter checking
cacheinfo: Keep the old value if of_property_read_u32 fails
Linus Torvalds [Fri, 8 Feb 2019 18:51:59 +0000 (10:51 -0800)]
Merge tag 'staging-5.0-rc6' of git://git./linux/kernel/git/gregkh/staging
Pull staging/IIO driver fixes from Greg KH:
"Here are some small iio and staging driver fixes for 5.0-rc6.
Nothing big, just resolve some reported IIO driver issues, and one
staging driver bug. One staging driver patch was added and then
reverted as well.
All of these have been in linux-next for a while with no reported
issues"
* tag 'staging-5.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
Revert "staging: erofs: keep corrupted fs from crashing kernel in erofs_namei()"
staging: erofs: keep corrupted fs from crashing kernel in erofs_namei()
staging: octeon: fix broken phylib usage
iio: ti-ads8688: Update buffer allocation for timestamps
tools: iio: iio_generic_buffer: make num_loops signed
iio: adc: axp288: Fix TS-pin handling
iio: chemical: atlas-ph-sensor: correct IIO_TEMP values to millicelsius
Linus Torvalds [Fri, 8 Feb 2019 18:49:55 +0000 (10:49 -0800)]
Merge tag 'tty-5.0-rc6' of git://git./linux/kernel/git/gregkh/tty
Pull tty/serial fixes from Greg KH:
"Here are some small tty and serial fixes for 5.0-rc6.
Nothing huge, just a few small fixes for reported issues. The speakup
fix is in here as it is a tty operation issue.
All of these have been in linux-next for a while with no reported
problems"
* tag 'tty-5.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
serial: fix race between flush_to_ldisc and tty_open
staging: speakup: fix tty-operation NULL derefs
serial: sh-sci: Do not free irqs that have already been freed
serial: 8250_pci: Make PCI class test non fatal
tty: serial: 8250_mtk: Fix potential NULL pointer dereference
Linus Torvalds [Fri, 8 Feb 2019 18:48:26 +0000 (10:48 -0800)]
Merge tag 'usb-5.0-rc6' of git://git./linux/kernel/git/gregkh/usb
Pull USB fixes from Grek KH:
"Here are some small USB fixes for 5.0-rc6.
Nothing huge, the normal amount of USB gadget fixes as well as some
USB phy fixes. There's also a typec fix as well. Full details are in
the shortlog.
All of these have been in linux-next for a while with no reported
issues"
* tag 'usb-5.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: typec: tcpm: Correct the PPS out_volt calculation
usb: gadget: musb: fix short isoc packets with inventra dma
usb: phy: am335x: fix race condition in _probe
usb: dwc3: exynos: Fix error handling of clk_prepare_enable
usb: phy: fix link errors
usb: gadget: udc: net2272: Fix bitwise and boolean operations
usb: dwc3: gadget: Handle 0 xfer length for OUT EP
Linus Torvalds [Fri, 8 Feb 2019 18:46:14 +0000 (10:46 -0800)]
Merge tag 'xfs-5.0-fixes-1' of git://git./fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"Here are a handful of XFS fixes to fix a data corruption problem, a
crasher bug, and a deadlock.
Summary:
- Fix cache coherency problem with writeback mappings
- Fix buffer deadlock when shutting fs down
- Fix a null pointer dereference when running online repair"
* tag 'xfs-5.0-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: set buffer ops when repair probes for btree type
xfs: end sync buffer I/O properly on shutdown error
xfs: eof trim writeback mapping as soon as it is cached
Linus Torvalds [Fri, 8 Feb 2019 18:43:15 +0000 (10:43 -0800)]
Merge tag 'drm-fixes-2019-02-08' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"Missed fixes last week as had nothing until amdgpu showed up on
Saturday. Other stuff has since rolled in along with some more amdgpu
fixes, so we have two weeks of those, and some i915, vmwgfx, sun4i,
rockchip and omap fixes.
amdgpu/radeon:
- fix crash on passthrough for SI
- fencing fix for shared buffers
- APU hwmon fix
- API powerplay fix
- eDP freesync fix
- PASID mgr locking fix
- KFD warning fix
- DC/powerplay fix
- raven revision ids fix
- vega20 doorbell fix
i915:
- SNB display fix
- SKL srckey mask fix
- ICL DDI clock selection fix
vmwgfx:
- DMA API fix
- IOMMU detection fix
- display fixes
sun4i:
- tcon clock fix
rockchip:
- SPDX identifier fix
omap:
- DSI fixes"
* tag 'drm-fixes-2019-02-08' of git://anongit.freedesktop.org/drm/drm: (28 commits)
drm/omap: dsi: Hack-fix DSI bus flags
drm/omap: dsi: Fix OF platform depopulate
drm/omap: dsi: Fix crash in DSI debug dumps
drm/i915: Try to sanitize bogus DPLL state left over by broken SNB BIOSen
drm/amd/display: Attach VRR properties for eDP connectors
drm/amdkfd: Fix if preprocessor statement above kfd_fill_iolink_info_for_cpu
drm/amdgpu: use spin_lock_irqsave to protect vm_manager.pasid_idr
drm/i915: always return something on DDI clock selection
drm/i915: Fix skl srckey mask bits
drm/vmwgfx: Improve on IOMMU detection
drm/vmwgfx: Fix setting of dma masks
drm/vmwgfx: Also check for crtc status while checking for DU active
drm/vmwgfx: Fix an uninitialized fence handle value
drm/vmwgfx: Return error code from vmw_execbuf_copy_fence_user
drm/sun4i: tcon: Prepare and enable TCON channel 0 clock at init
drm/amdgpu: fix the incorrect external id for raven series
drm/amdgpu: Implement doorbell self-ring for NBIO 7.4
drm/amd/display: Fix fclk idle state
drm/amdgpu: Transfer fences to dmabuf importer
drm/amd/powerplay: Fix missing break in switch
...
Lukas Bulwahn [Sat, 12 Jan 2019 09:07:23 +0000 (10:07 +0100)]
MAINTAINERS: unify reference to xen-devel list
In the linux kernel MAINTAINERS file, largely
"xen-devel@lists.xenproject.org (moderated for non-subscribers)"
is used to refer to the xen-devel mailing list.
The DRM DRIVERS FOR XEN section entry mentions
xen-devel@lists.xen.org instead, but that is just the same
mailing list as the mailing list above.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Guenter Roeck [Thu, 7 Feb 2019 05:13:49 +0000 (21:13 -0800)]
cdrom: Fix race condition in cdrom_sysctl_register
The following traceback is sometimes seen when booting an image in qemu:
[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50
The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.
Use a static atomic to ensure that the function is executed exactly once.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Peter Zijlstra [Fri, 8 Feb 2019 12:08:59 +0000 (13:08 +0100)]
x86/mm/cpa: Fix set_mce_nospec()
The recent commit
fe0937b24ff5 ("x86/mm/cpa: Fold cpa_flush_range() and
cpa_flush_array() into a single cpa_flush() function") accidentally made
the call to make_addr_canonical_again() go away, which breaks
set_mce_nospec().
Re-instate the call to convert the address back into canonical form right
before invoking either CLFLUSH or INVLPG. Rename the function while at it
to be shorter (and less MAGA).
Fixes:
fe0937b24ff5 ("x86/mm/cpa: Fold cpa_flush_range() and cpa_flush_array() into a single cpa_flush() function")
Reported-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Rik van Riel <riel@surriel.com>
Link: https://lkml.kernel.org/r/20190208120859.GH32511@hirez.programming.kicks-ass.net