Ping-Xiang Chen [Wed, 14 Sep 2022 07:42:37 +0000 (00:42 -0700)]
block: fix comment typo in submit_bio of block-core.c.
This patch fix a comment typo in block-core.c.
Signed-off-by: Ping-Xiang Chen <p.x.chen@uci.edu>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220914074237.31621-1-p.x.chen@uci.edu
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 20 Sep 2022 13:50:14 +0000 (07:50 -0600)]
Merge tag 'nvme-6.1-2022-09-20' of git://git.infradead.org/nvme into for-6.1/block
Pull NVMe updates from Christoph:
"nvme updates for Linux 6.1
- handle number of queue changes in the TCP and RDMA drivers
(Daniel Wagner)
- allow changing the number of queues in nvmet (Daniel Wagner)
- also consider host_iface when checking ip options (Daniel Wagner)
- don't map pages which can't come from HIGHMEM (Fabio M. De Francesco)
- avoid unnecessary flush bios in nvmet (Guixin Liu)
- shrink and better pack the nvme_iod structure (Keith Busch)
- add comment for unaligned "fake" nqn (Linjun Bao)
- print actual source IP address through sysfs "address" attr
(Martin Belanger)
- various cleanups (Jackie Liu, Wolfram Sang, Genjian Zhang)"
* tag 'nvme-6.1-2022-09-20' of git://git.infradead.org/nvme:
nvme-tcp: print actual source IP address through sysfs "address" attr
nvmet-tcp: don't map pages which can't come from HIGHMEM
nvme-pci: move iod dma_len fill gaps
nvme-pci: iod npages fits in s8
nvme-pci: iod's 'aborted' is a bool
nvme-pci: remove nvme_queue from nvme_iod
nvme: consider also host_iface when checking ip options
nvme-rdma: handle number of queue changes
nvme-tcp: handle number of queue changes
nvmet: expose max queues to configfs
nvmet: avoid unnecessary flush bio
nvmet-auth: remove redundant parameters req
nvmet-auth: clean up with done_kfree
nvme-auth: remove the redundant req->cqe->result.u16 assignment operation
nvme: move from strlcpy with unused retval to strscpy
nvme: add comment for unaligned "fake" nqn
Coly Li [Mon, 19 Sep 2022 16:16:47 +0000 (00:16 +0800)]
bcache: fix set_at_max_writeback_rate() for multiple attached devices
Inside set_at_max_writeback_rate() the calculation in following if()
check is wrong,
if (atomic_inc_return(&c->idle_counter) <
atomic_read(&c->attached_dev_nr) * 6)
Because each attached backing device has its own writeback thread
running and increasing c->idle_counter, the counter increates much
faster than expected. The correct calculation should be,
(counter / dev_nr) < dev_nr * 6
which equals to,
counter < dev_nr * dev_nr * 6
This patch fixes the above mistake with correct calculation, and helper
routine idle_counter_exceeded() is added to make code be more clear.
Reported-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Acked-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Link: https://lore.kernel.org/r/20220919161647.81238-6-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jilin Yuan [Mon, 19 Sep 2022 16:16:46 +0000 (00:16 +0800)]
bcache:: fix repeated words in comments
Delete the redundant word 'we'.
Signed-off-by: Jilin Yuan <yuanjilin@cdjrlc.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20220919161647.81238-5-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jules Maselbas [Mon, 19 Sep 2022 16:16:45 +0000 (00:16 +0800)]
bcache: bset: Fix comment typos
Remove the redundant word `by`, correct the typo `creaated`.
CC: Kent Overstreet <kent.overstreet@gmail.com>
CC: linux-bcache@vger.kernel.org
Signed-off-by: Jules Maselbas <jmaselbas@kalray.eu>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20220919161647.81238-4-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Lin Feng [Mon, 19 Sep 2022 16:16:44 +0000 (00:16 +0800)]
bcache: remove unused bch_mark_cache_readahead function def in stats.h
This is a cleanup for commit
1616a4c2ab1a ("bcache: remove bcache device
self-defined readahead")', currently no user for
bch_mark_cache_readahead() since that commit.
Signed-off-by: Lin Feng <linf@wangsu.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20220919161647.81238-3-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Li Lei [Mon, 19 Sep 2022 16:16:43 +0000 (00:16 +0800)]
bcache: remove unnecessary flush_workqueue
All pending works will be drained by destroy_workqueue(), no need to call
flush_workqueue() explicitly.
Signed-off-by: Li Lei <lilei@szsandstone.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20220919161647.81238-2-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Martin Belanger [Wed, 7 Sep 2022 12:27:37 +0000 (08:27 -0400)]
nvme-tcp: print actual source IP address through sysfs "address" attr
TCP transport relies on the routing table to determine which source
address and interface to use when making a connection. Currently, there
is no way to tell from userspace where a connection was made. This
patch exposes the actual source address using a new field named
"src_addr=" in the "address" attribute.
This is needed to diagnose and identify connectivity issues. With the
source address we can infer the interface associated with each
connection.
This was tested with nvme-cli 2.0 to verify it does not have any
adverse effect. The new "src_addr=" field will simply be displayed in
the output of the "list-subsys" or "list -v" commands as shown here.
$ nvme list-subsys
nvme-subsys0 - NQN=nqn.2014-08.org.nvmexpress.discovery
\
+- nvme0 tcp traddr=192.168.56.1,trsvcid=8009,src_addr=192.168.56.101 live
Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Fabio M. De Francesco [Tue, 30 Aug 2022 22:05:33 +0000 (00:05 +0200)]
nvmet-tcp: don't map pages which can't come from HIGHMEM
kmap() is being deprecated in favor of kmap_local_page().[1]
There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap’s pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.
The pages which will be mapped are allocated in nvmet_tcp_map_data(),
using the GFP_KERNEL flag. This assures that they cannot come from
HIGHMEM. This imply that a straight page_address() can replace the kmap()
of sg_page(sg) in nvmet_tcp_map_pdu_iovec(). As a side effect, we might
also delete the field "nr_mapped" from struct "nvmet_tcp_cmd" because,
after removing the kmap() calls, there would be no longer any need of it.
In addition, there is no reason to use a kvec for the command receive
data buffers iovec, use a bio_vec instead and let iov_iter handle the
buffer mapping and data copy.
Test with blktests on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel with
HIGHMEM64GB enabled.
[1] "[PATCH] checkpatch: Add kmap and kmap_atomic to the deprecated
list" https://lore.kernel.org/all/
20220813220034.806698-1-ira.weiny@intel.com/
Cc: Chaitanya Kulkarni <chaitanyak@nvidia.com>
Cc: Keith Busch <kbusch@kernel.org>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
[sagi: added bio_vec plus minor naming changes]
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Keith Busch [Wed, 7 Sep 2022 06:42:07 +0000 (08:42 +0200)]
nvme-pci: move iod dma_len fill gaps
The 32-bit field, dma_len, packs better in the iod struct above the
dma_addr_t on 64-bit systems.
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Keith Busch [Tue, 6 Sep 2022 16:07:37 +0000 (09:07 -0700)]
nvme-pci: iod npages fits in s8
The largest allowed transfer is 4MB, which can use at most 1025 PRPs.
Each PRP is 8 bytes, so the maximum number of 4k nvme pages needed for
the iod_list is 3, which fits in an 's8' type.
While modifying this field, change the name to "nr_allocations" to
better represent that this is referring to the number of units allocated
from a dma_pool.
Also introduce a BUILD_BUG_ON to ensure we never accidently increase the
largest transfer limit beyond 127 chained prp lists.
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Keith Busch [Tue, 6 Sep 2022 16:07:36 +0000 (09:07 -0700)]
nvme-pci: iod's 'aborted' is a bool
It's only true or false, so make this a bool to reflect that and save
some space in nvme_iod.
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Keith Busch [Tue, 6 Sep 2022 16:07:35 +0000 (09:07 -0700)]
nvme-pci: remove nvme_queue from nvme_iod
We can get the nvme_queue from the req just as easily, so remove the
duplicate path to the same structure to save some space.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Daniel Wagner [Fri, 29 Jul 2022 14:26:30 +0000 (16:26 +0200)]
nvme: consider also host_iface when checking ip options
It's perfectly fine to use the same traddr and trsvcid more than once
as long we use different host interface. This is used in setups where
the host has more than one interface but the target exposes only one
traddr/trsvcid combination.
Use the same acceptance rules for host_iface as we have for
host_traddr.
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Daniel Wagner [Mon, 29 Aug 2022 09:28:41 +0000 (11:28 +0200)]
nvme-rdma: handle number of queue changes
On reconnect, the number of queues might have changed.
In the case where we have more queues available than previously we try
to access queues which are not initialized yet.
The other case where we have less queues than previously, the
connection attempt will fail because the target doesn't support the
old number of queues and we end up in a reconnect loop.
Thus, only start queues which are currently present in the tagset
limited by the number of available queues. Then we update the tagset
and we can start any new queue.
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Daniel Wagner [Mon, 29 Aug 2022 09:28:40 +0000 (11:28 +0200)]
nvme-tcp: handle number of queue changes
On reconnect, the number of queues might have changed.
In the case where we have more queues available than previously we try
to access queues which are not initialized yet.
The other case where we have less queues than previously, the
connection attempt will fail because the target doesn't support the
old number of queues and we end up in a reconnect loop.
Thus, only start queues which are currently present in the tagset
limited by the number of available queues. Then we update the tagset
and we can start any new queue.
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Daniel Wagner [Mon, 29 Aug 2022 09:28:39 +0000 (11:28 +0200)]
nvmet: expose max queues to configfs
Allow to set the max queues the target supports. This is useful for
testing the reconnect attempt of the host with changing numbers of
supported queues.
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Guixin Liu [Fri, 29 Jul 2022 03:53:05 +0000 (11:53 +0800)]
nvmet: avoid unnecessary flush bio
For no volatile write cache block device backend, sending flush bio is
unnecessary, avoid to do that.
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Genjian Zhang [Tue, 23 Aug 2022 02:14:41 +0000 (10:14 +0800)]
nvmet-auth: remove redundant parameters req
The parameter is not used in this function, so remove it.
Signed-off-by: Genjian Zhang <zhanggenjian@kylinos.cn>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Jackie Liu [Fri, 12 Aug 2022 03:12:30 +0000 (11:12 +0800)]
nvmet-auth: clean up with done_kfree
Jump directly to done_kfree to release d, which is consistent with the
code style behind.
Reported-by: Genjian Zhang <zhanggenjian@kylinos.cn>
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Jackie Liu [Fri, 12 Aug 2022 03:12:31 +0000 (11:12 +0800)]
nvme-auth: remove the redundant req->cqe->result.u16 assignment operation
req->cqe->result.u16 has already been assigned in the previous line, no
need to do it again.
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Wolfram Sang [Thu, 18 Aug 2022 21:00:52 +0000 (23:00 +0200)]
nvme: move from strlcpy with unused retval to strscpy
Follow the advice of the below link and prefer 'strscpy' in this
subsystem. Conversion is 1:1 because the return value is not used.
Generated by a coccinelle script.
Link: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=V6A6G1oUZcprmknw@mail.gmail.com/
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Linjun Bao [Thu, 11 Aug 2022 07:40:24 +0000 (15:40 +0800)]
nvme: add comment for unaligned "fake" nqn
Current "fake" nqn field is "nqn.2014.08.org.nvmexpress:", it is
not aligned with the canonical version for history reasons.
Signed-off-by: Linjun Bao <meljbao@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Gaosheng Cui [Sun, 11 Sep 2022 09:26:45 +0000 (17:26 +0800)]
block/drbd: remove unused w_start_resync declaration
w_start_resync has been removed since
commit
ac0acb9e39ac ("drbd: use drbd_device_post_work()
in more places"), so remove it.
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Sat, 27 Aug 2022 10:16:37 +0000 (18:16 +0800)]
blk-throttle: cleanup tg_update_disptime()
tg_update_disptime() only need to adjust postion for 'tg' in
'parent_sq', there is no need to call throtl_enqueue/dequeue_tg(),
since they will set/clear flag THROTL_TG_PENDING and increase/decrease
nr_pending, which is useless. By the way, clear the flag/decrease
nr_pending while there are still throttled bios is not good for debugging.
There are no functional changes.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220827101637.1775111-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Sat, 27 Aug 2022 10:16:36 +0000 (18:16 +0800)]
blk-throttle: calling throtl_dequeue/enqueue_tg in pairs
It's a little weird to call throtl_dequeue_tg() unconditionally in
throtl_select_dispatch(), since it will be called in tg_update_disptime()
again if some bio is still throttled. Thus call it later if there are no
throttled bio. There are no functional changes.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220827101637.1775111-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Sat, 27 Aug 2022 10:16:35 +0000 (18:16 +0800)]
blk-throttle: use 'READ/WRITE' instead of '0/1'
Make the code easier to read, like everywhere else.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220827101637.1775111-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Mon, 29 Aug 2022 02:22:40 +0000 (10:22 +0800)]
blk-throttle: fix io hung due to configuration updates
If new configuration is submitted while a bio is throttled, then new
waiting time is recalculated regardless that the bio might already wait
for some time:
tg_conf_updated
throtl_start_new_slice
tg_update_disptime
throtl_schedule_next_dispatch
Then io hung can be triggered by always submmiting new configuration
before the throttled bio is dispatched.
Fix the problem by respecting the time that throttled bio already waited.
In order to do that, add new fields to record how many bytes/io are
waited, and use it to calculate wait time for throttled bio under new
configuration.
Some simple test:
1)
cd /sys/fs/cgroup/blkio/
echo $$ > cgroup.procs
echo "8:0 2048" > blkio.throttle.write_bps_device
{
sleep 2
echo "8:0 1024" > blkio.throttle.write_bps_device
} &
dd if=/dev/zero of=/dev/sda bs=8k count=1 oflag=direct
2)
cd /sys/fs/cgroup/blkio/
echo $$ > cgroup.procs
echo "8:0 1024" > blkio.throttle.write_bps_device
{
sleep 4
echo "8:0 2048" > blkio.throttle.write_bps_device
} &
dd if=/dev/zero of=/dev/sda bs=8k count=1 oflag=direct
test results: io finish time
before this patch with this patch
1) 10s 6s
2) 8s 6s
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220829022240.3348319-5-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Mon, 29 Aug 2022 02:22:39 +0000 (10:22 +0800)]
blk-throttle: factor out code to calculate ios/bytes_allowed
No functional changes, new apis will be used in later patches to
calculate wait time for throttled bios when new configuration is
submitted.
Noted this patch also rename tg_with_in_iops/bps_limit() to
tg_within_iops/bps_limit().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220829022240.3348319-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Mon, 29 Aug 2022 02:22:38 +0000 (10:22 +0800)]
blk-throttle: prevent overflow while calculating wait time
There is a problem found by code review in tg_with_in_bps_limit() that
'bps_limit * jiffy_elapsed_rnd' might overflow. Fix the problem by
calling mul_u64_u64_div_u64() instead.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220829022240.3348319-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Mon, 29 Aug 2022 02:22:37 +0000 (10:22 +0800)]
blk-throttle: fix that io throttle can only work for single bio
Test scripts:
cd /sys/fs/cgroup/blkio/
echo "8:0 1024" > blkio.throttle.write_bps_device
echo $$ > cgroup.procs
dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct &
dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct &
Test result:
10240 bytes (10 kB, 10 KiB) copied, 10.0134 s, 1.0 kB/s
10240 bytes (10 kB, 10 KiB) copied, 10.0135 s, 1.0 kB/s
The problem is that the second bio is finished after 10s instead of 20s.
Root cause:
1) second bio will be flagged:
__blk_throtl_bio
while (true) {
...
if (sq->nr_queued[rw]) -> some bio is throttled already
break
};
bio_set_flag(bio, BIO_THROTTLED); -> flag the bio
2) flagged bio will be dispatched without waiting:
throtl_dispatch_tg
tg_may_dispatch
tg_with_in_bps_limit
if (bps_limit == U64_MAX || bio_flagged(bio, BIO_THROTTLED))
*wait = 0; -> wait time is zero
return true;
commit
9f5ede3c01f9 ("block: throttle split bio in case of iops limit")
support to count split bios for iops limit, thus it adds flagged bio
checking in tg_with_in_bps_limit() so that split bios will only count
once for bps limit, however, it introduce a new problem that io throttle
won't work if multiple bios are throttled.
In order to fix the problem, handle iops/bps limit in different ways:
1) for iops limit, there is no flag to record if the bio is throttled,
and iops is always applied.
2) for bps limit, original bio will be flagged with BIO_BPS_THROTTLED,
and io throttle will ignore bio with the flag.
Noted this patch also remove the code to set flag in __bio_clone(), it's
introduced in commit
111be8839817 ("block-throttle: avoid double
charge"), and author thinks split bio can be resubmited and throttled
again, which is wrong because split bio will continue to dispatch from
caller.
Fixes: 9f5ede3c01f9 ("block: throttle split bio in case of iops limit")
Cc: <stable@vger.kernel.org>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220829022240.3348319-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Fri, 9 Sep 2022 18:40:22 +0000 (11:40 -0700)]
sbitmap: fix batched wait_cnt accounting
Batched completions can clear multiple bits, but we're only decrementing
the wait_cnt by one each time. This can cause waiters to never be woken,
stalling IO. Use the batched count instead.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215679
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20220909184022.1709476-1-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Uros Bizjak [Thu, 8 Sep 2022 15:12:00 +0000 (17:12 +0200)]
sbitmap: Use atomic_long_try_cmpxchg in __sbitmap_queue_get_batch
Use atomic_long_try_cmpxchg instead of
atomic_long_cmpxchg (*ptr, old, new) == old in __sbitmap_queue_get_batch.
x86 CMPXCHG instruction returns success in ZF flag, so this change
saves a compare after cmpxchg (and related move instruction in front
of cmpxchg).
Also, atomic_long_cmpxchg implicitly assigns old *ptr value to "old"
when cmpxchg fails, enabling further code simplifications, e.g.
an extra memory read can be avoided in the loop.
No functional change intended.
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20220908151200.9993-1-ubizjak@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Shigeru Yoshida [Wed, 7 Sep 2022 16:35:02 +0000 (01:35 +0900)]
nbd: Fix hung when signal interrupts nbd_start_device_ioctl()
syzbot reported hung task [1]. The following program is a simplified
version of the reproducer:
int main(void)
{
int sv[2], fd;
if (socketpair(AF_UNIX, SOCK_STREAM, 0, sv) < 0)
return 1;
if ((fd = open("/dev/nbd0", 0)) < 0)
return 1;
if (ioctl(fd, NBD_SET_SIZE_BLOCKS, 0x81) < 0)
return 1;
if (ioctl(fd, NBD_SET_SOCK, sv[0]) < 0)
return 1;
if (ioctl(fd, NBD_DO_IT) < 0)
return 1;
return 0;
}
When signal interrupt nbd_start_device_ioctl() waiting the condition
atomic_read(&config->recv_threads) == 0, the task can hung because it
waits the completion of the inflight IOs.
This patch fixes the issue by clearing queue, not just shutdown, when
signal interrupt nbd_start_device_ioctl().
Link: https://syzkaller.appspot.com/bug?id=7d89a3ffacd2b83fdd39549bc4d8e0a89ef21239
Reported-by: syzbot+38e6c55d4969a14c1534@syzkaller.appspotmail.com
Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20220907163502.577561-1-syoshida@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 8 Sep 2022 13:09:37 +0000 (15:09 +0200)]
sbitmap: Avoid leaving waitqueue in invalid state in __sbq_wake_up()
When __sbq_wake_up() decrements wait_cnt to 0 but races with someone
else waking the waiter on the waitqueue (so the waitqueue becomes
empty), it exits without reseting wait_cnt to wake_batch number. Once
wait_cnt is 0, nobody will ever reset the wait_cnt or wake the new
waiters resulting in possible deadlocks or busyloops. Fix the problem by
making sure we reset wait_cnt even if we didn't wake up anybody in the
end.
Fixes: 040b83fcecfb ("sbitmap: fix possible io hung due to lost wakeup")
Reported-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220908130937.2795-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Guoqing Jiang [Fri, 2 Sep 2022 10:00:55 +0000 (18:00 +0800)]
rnbd-srv: remove redundant setting of blk_open_flags
It is not necessary since it is set later just before function
return success.
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Acked-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Link: https://lore.kernel.org/r/20220902100055.25724-4-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Guoqing Jiang [Fri, 2 Sep 2022 10:00:54 +0000 (18:00 +0800)]
rnbd-srv: make process_msg_close returns void
Change the return type to void given it always returns 0.
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Acked-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Link: https://lore.kernel.org/r/20220902100055.25724-3-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Guoqing Jiang [Fri, 2 Sep 2022 10:00:53 +0000 (18:00 +0800)]
rnbd-srv: add comment in rnbd_srv_rdma_ev
Let's add some explanations here given the err handling is not obvious.
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Acked-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Link: https://lore.kernel.org/r/20220902100055.25724-2-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Miaohe Lin [Mon, 5 Sep 2022 10:27:54 +0000 (18:27 +0800)]
block: remove unneeded return value of bio_check_ro()
bio_check_ro() always return false now. Remove this unneeded return value
and cleanup the sole caller. No functional change intended.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Link: https://lore.kernel.org/r/20220905102754.1942-1-linmiaohe@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Miaohe Lin [Mon, 5 Sep 2022 10:19:50 +0000 (18:19 +0800)]
blk-mq: remove unneeded needs_restart check
If code reaches here, needs_restart must be true. Remove this unneeded
needs_restart check. No functional change intended.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Link: https://lore.kernel.org/r/20220905101950.4606-1-linmiaohe@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jiapeng Chong [Mon, 5 Sep 2022 06:32:53 +0000 (14:32 +0800)]
block/blk-map: Remove set but unused variable 'added'
The variable added is not effectively used in the function, so delete
it.
block/blk-map.c:273:16: warning: variable 'added' set but not used.
Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2049
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220905063253.120082-1-jiapeng.chong@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Sat, 3 Sep 2022 06:28:26 +0000 (14:28 +0800)]
blk-throttle: clean up codes that can't be reached
While doing code coverage testing while CONFIG_BLK_DEV_THROTTLING_LOW is
disabled, we found that there are many codes can never be reached.
This patch move such codes inside "#ifdef CONFIG_BLK_DEV_THROTTLING_LOW".
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220903062826.1099085-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sun, 4 Sep 2022 12:39:25 +0000 (06:39 -0600)]
Revert "sbitmap: fix batched wait_cnt accounting"
This reverts commit
16ede66973c84f890c03584f79158dd5b2d725f5.
This is causing issues with CPU stalls on my test box, revert it for
now until we understand what is going on. It looks like infinite
looping off sbitmap_queue_wake_up(), but hard to tell with a lot of
CPUs hitting this issue and the console scrolling infinitely.
Link: https://lore.kernel.org/linux-block/e742813b-ce5c-0d58-205b-1626f639b1bd@kernel.dk/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 2 Sep 2022 16:40:11 +0000 (10:40 -0600)]
block: enable per-cpu bio caching for the fs bio set
This is useful for polled IO on a file, or for polled IO with the
io_uring passthrough mechanism. If bio allocations are done with
REQ_POLLED for those cases, then initializing the bio set with
BIOSET_PERCPU_CACHE enables the local per-cpu cache which eliminates
allocations (and frees) of bio structs when possible.
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Thu, 25 Aug 2022 14:53:12 +0000 (07:53 -0700)]
sbitmap: fix batched wait_cnt accounting
Batched completions can clear multiple bits, but we're only decrementing
the wait_cnt by one each time. This can cause waiters to never be woken,
stalling IO. Use the batched count instead.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215679
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20220825145312.1217900-1-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Liu Song [Fri, 26 Aug 2022 03:14:13 +0000 (11:14 +0800)]
sbitmap: remove unnecessary code in __sbitmap_queue_get_batch
If "nr + nr_tags <= map_depth", then the value of nr_tags will not be
greater than map_depth, so no additional comparison is required.
Signed-off-by: Liu Song <liusong@linux.alibaba.com>
Link: https://lore.kernel.org/r/1661483653-27326-1-git-send-email-liusong@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ye xingchen [Wed, 24 Aug 2022 07:52:13 +0000 (07:52 +0000)]
block/rnbd-clt: Remove the unneeded result variable
Return the value from rtrs_clt_rdma_cq_direct() directly instead of
storing it in another redundant variable.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20220824075213.221397-1-ye.xingchen@zte.com.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Wed, 3 Aug 2022 12:15:04 +0000 (20:15 +0800)]
sbitmap: fix possible io hung due to lost wakeup
There are two problems can lead to lost wakeup:
1) invalid wakeup on the wrong waitqueue:
For example, 2 * wake_batch tags are put, while only wake_batch threads
are woken:
__sbq_wake_up
atomic_cmpxchg -> reset wait_cnt
__sbq_wake_up -> decrease wait_cnt
...
__sbq_wake_up -> wait_cnt is decreased to 0 again
atomic_cmpxchg
sbq_index_atomic_inc -> increase wake_index
wake_up_nr -> wake up and waitqueue might be empty
sbq_index_atomic_inc -> increase again, one waitqueue is skipped
wake_up_nr -> invalid wake up because old wakequeue might be empty
To fix the problem, increasing 'wake_index' before resetting 'wait_cnt'.
2) 'wait_cnt' can be decreased while waitqueue is empty
As pointed out by Jan Kara, following race is possible:
CPU1 CPU2
__sbq_wake_up __sbq_wake_up
sbq_wake_ptr() sbq_wake_ptr() -> the same
wait_cnt = atomic_dec_return()
/* decreased to 0 */
sbq_index_atomic_inc()
/* move to next waitqueue */
atomic_set()
/* reset wait_cnt */
wake_up_nr()
/* wake up on the old waitqueue */
wait_cnt = atomic_dec_return()
/*
* decrease wait_cnt in the old
* waitqueue, while it can be
* empty.
*/
Fix the problem by waking up before updating 'wake_index' and
'wait_cnt'.
With this patch, noted that 'wait_cnt' is still decreased in the old
empty waitqueue, however, the wakeup is redirected to a active waitqueue,
and the extra decrement on the old empty waitqueue is not handled.
Fixes: 88459642cba4 ("blk-mq: abstract tag allocation out into sbitmap library")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220803121504.212071-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 5 Aug 2022 22:44:34 +0000 (16:44 -0600)]
block: use on-stack page vec for <= UIO_FASTIOV
Avoid a kmalloc+kfree for each page array, if we only have a few pages
that are mapped. An alloc+free for each IO is quite expensive, and
it's pretty pointless if we're only dealing with 1 or a few vecs.
Use UIO_FASTIOV like we do in other spots to set a sane limit for how
big of an IO we want to avoid allocations for.
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 5 Aug 2022 22:43:09 +0000 (16:43 -0600)]
block: enable bio caching use for passthru IO
bdev based polled O_DIRECT is currently quite a bit faster than
passthru on the same device, and one of the reaons is that we're not
able to use the bio caching for passthru IO.
If REQ_POLLED is set on the request, use the fs bio set for grabbing a
bio from the caches, if available. This saves 5-6% of CPU over head
for polled passthru IO.
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 5 Aug 2022 22:39:04 +0000 (16:39 -0600)]
block: shrink rq_map_data a bit
We don't need full ints for several of these members. Change the
page_order and nr_entries to unsigned shorts, and the true/false from_user
and null_mapped to booleans.
This shrinks the struct from 32 to 24 bytes on 64-bit archs.
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Tue, 16 Aug 2022 01:56:31 +0000 (09:56 +0800)]
block, bfq: remove useless parameter for bfq_add/del_bfqq_busy()
'bfqd' can be accessed through 'bfqq->bfqd', there is no need to pass
it as a parameter separately.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220816015631.1323948-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Tue, 16 Aug 2022 01:56:30 +0000 (09:56 +0800)]
block, bfq: remove useless checking in bfq_put_queue()
'bfqq->bfqd' is ensured to set in bfq_init_queue(), and it will never
change afterwards.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220816015631.1323948-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Tue, 16 Aug 2022 01:56:29 +0000 (09:56 +0800)]
block, bfq: remove unused functions
While doing code coverage testing(CONFIG_BFQ_CGROUP_DEBUG is disabled), we
found that some functions doesn't have caller, thus remove them.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220816015631.1323948-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Mon, 15 Aug 2022 17:00:43 +0000 (10:00 -0700)]
block: Change the return type of blk_mq_map_queues() into void
Since blk_mq_map_queues() and the .map_queues() callbacks always return 0,
change their return type into void. Most callers ignore the returned value
anyway.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: John Garry <john.garry@huawei.com>
Acked-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20220815170043.19489-3-bvanassche@acm.org
[axboe: fold in fix from Bart]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Mon, 15 Aug 2022 17:00:42 +0000 (10:00 -0700)]
null_blk: Modify the behavior of null_map_queues()
Instead of returning -EINVAL if an internal inconsistency is detected,
fall back to a single submission queue. This patch prepares for changing
the return value of the .map_queues() callbacks into void.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220815170043.19489-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Santosh Pradhan [Thu, 18 Aug 2022 10:55:51 +0000 (12:55 +0200)]
block/rnbd-srv: Add event tracing support
Add event tracing mechanism for following routines:
- create_sess()
- destroy_sess()
- process_rdma()
- process_msg_sess_info()
- process_msg_open()
- process_msg_close()
How to use:
1. Load the rnbd_server module
2. cd /sys/kernel/debug/tracing
3. If all the events need to be enabled:
echo 1 > events/rnbd_srv/enable
4. OR only speific routine/event needs to be enabled e.g.
echo 1 > events/rnbd_srv/create_sess/enable
5. cat trace
5. Run some workload which can trigger create_sess() routine/event
Signed-off-by: Santosh Pradhan <santosh.pradhan@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Link: https://lore.kernel.org/r/20220818105551.110490-2-haris.iqbal@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
dougmill@linux.vnet.ibm.com [Tue, 16 Aug 2022 14:07:13 +0000 (15:07 +0100)]
block: sed-opal: Add ioctl to return device status
Provide a mechanism to retrieve basic status information about
the device, including the "supported" flag indicating whether
SED-OPAL is supported. The information returned is from the various
feature descriptors received during the discovery0 step, and so
this ioctl does nothing more than perform the discovery0 step
and then save the information received. See "struct opal_status"
and OPAL_FL_* bits for the status information currently returned.
This is necessary to be able to check whether a device is OPAL
enabled, set up, locked or unlocked from userspace programs
like systemd-cryptsetup and libcryptsetup. Right now we just
have to assume the user 'knows' or blindly attempt setup/lock/unlock
operations.
Signed-off-by: Douglas Miller <dougmill@linux.vnet.ibm.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Link: https://lore.kernel.org/r/20220816140713.84893-1-luca.boccassi@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Mon, 22 Aug 2022 00:32:54 +0000 (17:32 -0700)]
Linux 6.0-rc2
Linus Torvalds [Sun, 21 Aug 2022 22:09:55 +0000 (15:09 -0700)]
Merge tag 'irq-urgent-2022-08-21' of git://git./linux/kernel/git/tip/tip
Pull irq fixes from Ingo Molnar:
"Misc irqchip fixes: LoongArch driver fixes and a Hyper-V IOMMU fix"
* tag 'irq-urgent-2022-08-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/loongson-liointc: Fix an error handling path in liointc_init()
irqchip/loongarch: Fix irq_domain_alloc_fwnode() abuse
irqchip/loongson-pch-pic: Move find_pch_pic() into CONFIG_ACPI
irqchip/loongson-eiointc: Fix a build warning
irqchip/loongson-eiointc: Fix irq affinity setting
iommu/hyper-v: Use helper instead of directly accessing affinity
Linus Torvalds [Sun, 21 Aug 2022 22:01:51 +0000 (15:01 -0700)]
Merge tag 'perf-urgent-2022-08-21' of git://git./linux/kernel/git/tip/tip
Pull x86 kprobes fix from Ingo Molnar:
"Fix a kprobes bug in JNG/JNLE emulation when a kprobe is installed at
such instructions, possibly resulting in incorrect execution (the
wrong branch taken)"
* tag 'perf-urgent-2022-08-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kprobes: Fix JNG/JNLE emulation
Linus Torvalds [Sun, 21 Aug 2022 21:49:42 +0000 (14:49 -0700)]
Merge tag 'trace-v6.0-rc1-2' of git://git./linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Various fixes for tracing:
- Fix a return value of traceprobe_parse_event_name()
- Fix NULL pointer dereference from failed ftrace enabling
- Fix NULL pointer dereference when asking for registers from eprobes
- Make eprobes consistent with kprobes/uprobes, filters and
histograms"
* tag 'trace-v6.0-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Have filter accept "common_cpu" to be consistent
tracing/probes: Have kprobes and uprobes use $COMM too
tracing/eprobes: Have event probes be consistent with kprobes and uprobes
tracing/eprobes: Fix reading of string fields
tracing/eprobes: Do not hardcode $comm as a string
tracing/eprobes: Do not allow eprobes to use $stack, or % for regs
ftrace: Fix NULL pointer dereference in is_ftrace_trampoline when ftrace is dead
tracing/perf: Fix double put of trace event when init fails
tracing: React to error return from traceprobe_parse_event_name()
Steven Rostedt (Google) [Sat, 20 Aug 2022 13:43:22 +0000 (09:43 -0400)]
tracing: Have filter accept "common_cpu" to be consistent
Make filtering consistent with histograms. As "cpu" can be a field of an
event, allow for "common_cpu" to keep it from being confused with the
"cpu" field of the event.
Link: https://lkml.kernel.org/r/20220820134401.513062765@goodmis.org
Link: https://lore.kernel.org/all/20220820220920.e42fa32b70505b1904f0a0ad@kernel.org/
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 1e3bac71c5053 ("tracing/histogram: Rename "cpu" to "common_cpu"")
Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Steven Rostedt (Google) [Sat, 20 Aug 2022 13:43:21 +0000 (09:43 -0400)]
tracing/probes: Have kprobes and uprobes use $COMM too
Both $comm and $COMM can be used to get current->comm in eprobes and the
filtering and histogram logic. Make kprobes and uprobes consistent in this
regard and allow both $comm and $COMM as well. Currently kprobes and
uprobes only handle $comm, which is inconsistent with the other utilities,
and can be confusing to users.
Link: https://lkml.kernel.org/r/20220820134401.317014913@goodmis.org
Link: https://lore.kernel.org/all/20220820220442.776e1ddaf8836e82edb34d01@kernel.org/
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 533059281ee5 ("tracing: probeevent: Introduce new argument fetching code")
Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Steven Rostedt (Google) [Sat, 20 Aug 2022 13:43:20 +0000 (09:43 -0400)]
tracing/eprobes: Have event probes be consistent with kprobes and uprobes
Currently, if a symbol "@" is attempted to be used with an event probe
(eprobes), it will cause a NULL pointer dereference crash.
Both kprobes and uprobes can reference data other than the main registers.
Such as immediate address, symbols and the current task name. Have eprobes
do the same thing.
For "comm", if "comm" is used and the event being attached to does not
have the "comm" field, then make it the "$comm" that kprobes has. This is
consistent to the way histograms and filters work.
Link: https://lkml.kernel.org/r/20220820134401.136924220@goodmis.org
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Steven Rostedt (Google) [Sat, 20 Aug 2022 13:43:19 +0000 (09:43 -0400)]
tracing/eprobes: Fix reading of string fields
Currently when an event probe (eprobe) hooks to a string field, it does
not display it as a string, but instead as a number. This makes the field
rather useless. Handle the different kinds of strings, dynamic, static,
relational/dynamic etc.
Now when a string field is used, the ":string" type can be used to display
it:
echo "e:sw sched/sched_switch comm=$next_comm:string" > dynamic_events
Link: https://lkml.kernel.org/r/20220820134400.959640191@goodmis.org
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Steven Rostedt (Google) [Sat, 20 Aug 2022 13:43:18 +0000 (09:43 -0400)]
tracing/eprobes: Do not hardcode $comm as a string
The variable $comm is hard coded as a string, which is true for both
kprobes and uprobes, but for event probes (eprobes) it is a field name. In
most cases the "comm" field would be a string, but there's no guarantee of
that fact.
Do not assume that comm is a string. Not to mention, it currently forces
comm fields to fault, as string processing for event probes is currently
broken.
Link: https://lkml.kernel.org/r/20220820134400.756152112@goodmis.org
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Steven Rostedt (Google) [Sat, 20 Aug 2022 13:43:17 +0000 (09:43 -0400)]
tracing/eprobes: Do not allow eprobes to use $stack, or % for regs
While playing with event probes (eprobes), I tried to see what would
happen if I attempted to retrieve the instruction pointer (%rip) knowing
that event probes do not use pt_regs. The result was:
BUG: kernel NULL pointer dereference, address:
0000000000000024
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 1 PID: 1847 Comm: trace-cmd Not tainted 5.19.0-rc5-test+ #309
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01
v03.03 07/14/2016
RIP: 0010:get_event_field.isra.0+0x0/0x50
Code: ff 48 c7 c7 c0 8f 74 a1 e8 3d 8b f5 ff e8 88 09 f6 ff 4c 89 e7 e8
50 6a 13 00 48 89 ef 5b 5d 41 5c 41 5d e9 42 6a 13 00 66 90 <48> 63 47 24
8b 57 2c 48 01 c6 8b 47 28 83 f8 02 74 0e 83 f8 04 74
RSP: 0018:
ffff916c394bbaf0 EFLAGS:
00010086
RAX:
ffff916c854041d8 RBX:
ffff916c8d9fbf50 RCX:
ffff916c255d2000
RDX:
0000000000000000 RSI:
ffff916c255d2008 RDI:
0000000000000000
RBP:
0000000000000000 R08:
ffff916c3a2a0c08 R09:
ffff916c394bbda8
R10:
0000000000000000 R11:
0000000000000000 R12:
ffff916c854041d8
R13:
ffff916c854041b0 R14:
0000000000000000 R15:
0000000000000000
FS:
0000000000000000(0000) GS:
ffff916c9ea40000(0000)
knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
0000000000000024 CR3:
000000011b60a002 CR4:
00000000001706e0
Call Trace:
<TASK>
get_eprobe_size+0xb4/0x640
? __mod_node_page_state+0x72/0xc0
__eprobe_trace_func+0x59/0x1a0
? __mod_lruvec_page_state+0xaa/0x1b0
? page_remove_file_rmap+0x14/0x230
? page_remove_rmap+0xda/0x170
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x18f/0x240
trace_event_raw_event_sched_wakeup_template+0x7a/0xb0
try_to_wake_up+0x260/0x4c0
__wake_up_common+0x80/0x180
__wake_up_common_lock+0x7c/0xc0
do_notify_parent+0x1c9/0x2a0
exit_notify+0x1a9/0x220
do_exit+0x2ba/0x450
do_group_exit+0x2d/0x90
__x64_sys_exit_group+0x14/0x20
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x46/0xb0
Obviously this is not the desired result.
Move the testing for TPARG_FL_TPOINT which is only used for event probes
to the top of the "$" variable check, as all the other variables are not
used for event probes. Also add a check in the register parsing "%" to
fail if an event probe is used.
Link: https://lkml.kernel.org/r/20220820134400.564426983@goodmis.org
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Yang Jihong [Thu, 18 Aug 2022 03:26:59 +0000 (11:26 +0800)]
ftrace: Fix NULL pointer dereference in is_ftrace_trampoline when ftrace is dead
ftrace_startup does not remove ops from ftrace_ops_list when
ftrace_startup_enable fails:
register_ftrace_function
ftrace_startup
__register_ftrace_function
...
add_ftrace_ops(&ftrace_ops_list, ops)
...
...
ftrace_startup_enable // if ftrace failed to modify, ftrace_disabled is set to 1
...
return 0 // ops is in the ftrace_ops_list.
When ftrace_disabled = 1, unregister_ftrace_function simply returns without doing anything:
unregister_ftrace_function
ftrace_shutdown
if (unlikely(ftrace_disabled))
return -ENODEV; // return here, __unregister_ftrace_function is not executed,
// as a result, ops is still in the ftrace_ops_list
__unregister_ftrace_function
...
If ops is dynamically allocated, it will be free later, in this case,
is_ftrace_trampoline accesses NULL pointer:
is_ftrace_trampoline
ftrace_ops_trampoline
do_for_each_ftrace_op(op, ftrace_ops_list) // OOPS! op may be NULL!
Syzkaller reports as follows:
[ 1203.506103] BUG: kernel NULL pointer dereference, address:
000000000000010b
[ 1203.508039] #PF: supervisor read access in kernel mode
[ 1203.508798] #PF: error_code(0x0000) - not-present page
[ 1203.509558] PGD
800000011660b067 P4D
800000011660b067 PUD
130fb8067 PMD 0
[ 1203.510560] Oops: 0000 [#1] SMP KASAN PTI
[ 1203.511189] CPU: 6 PID: 29532 Comm: syz-executor.2 Tainted: G B W 5.10.0 #8
[ 1203.512324] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 1203.513895] RIP: 0010:is_ftrace_trampoline+0x26/0xb0
[ 1203.514644] Code: ff eb d3 90 41 55 41 54 49 89 fc 55 53 e8 f2 00 fd ff 48 8b 1d 3b 35 5d 03 e8 e6 00 fd ff 48 8d bb 90 00 00 00 e8 2a 81 26 00 <48> 8b ab 90 00 00 00 48 85 ed 74 1d e8 c9 00 fd ff 48 8d bb 98 00
[ 1203.518838] RSP: 0018:
ffffc900012cf960 EFLAGS:
00010246
[ 1203.520092] RAX:
0000000000000000 RBX:
000000000000007b RCX:
ffffffff8a331866
[ 1203.521469] RDX:
0000000000000000 RSI:
0000000000000008 RDI:
000000000000010b
[ 1203.522583] RBP:
0000000000000000 R08:
0000000000000000 R09:
ffffffff8df18b07
[ 1203.523550] R10:
fffffbfff1be3160 R11:
0000000000000001 R12:
0000000000478399
[ 1203.524596] R13:
0000000000000000 R14:
ffff888145088000 R15:
0000000000000008
[ 1203.525634] FS:
00007f429f5f4700(0000) GS:
ffff8881daf00000(0000) knlGS:
0000000000000000
[ 1203.526801] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 1203.527626] CR2:
000000000000010b CR3:
0000000170e1e001 CR4:
00000000003706e0
[ 1203.528611] DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
[ 1203.529605] DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
Therefore, when ftrace_startup_enable fails, we need to rollback registration
process and remove ops from ftrace_ops_list.
Link: https://lkml.kernel.org/r/20220818032659.56209-1-yangjihong1@huawei.com
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Steven Rostedt (Google) [Tue, 16 Aug 2022 23:28:17 +0000 (19:28 -0400)]
tracing/perf: Fix double put of trace event when init fails
If in perf_trace_event_init(), the perf_trace_event_open() fails, then it
will call perf_trace_event_unreg() which will not only unregister the perf
trace event, but will also call the put() function of the tp_event.
The problem here is that the trace_event_try_get_ref() is called by the
caller of perf_trace_event_init() and if perf_trace_event_init() returns a
failure, it will then call trace_event_put(). But since the
perf_trace_event_unreg() already called the trace_event_put() function, it
triggers a WARN_ON().
WARNING: CPU: 1 PID: 30309 at kernel/trace/trace_dynevent.c:46 trace_event_dyn_put_ref+0x15/0x20
If perf_trace_event_reg() does not call the trace_event_try_get_ref() then
the perf_trace_event_unreg() should not be calling trace_event_put(). This
breaks symmetry and causes bugs like these.
Pull out the trace_event_put() from perf_trace_event_unreg() and call it
in the locations that perf_trace_event_unreg() is called. This not only
fixes this bug, but also brings back the proper symmetry of the reg/unreg
vs get/put logic.
Link: https://lore.kernel.org/all/cover.1660347763.git.kjlx@templeofstupid.com/
Link: https://lkml.kernel.org/r/20220816192817.43d5e17f@gandalf.local.home
Cc: stable@vger.kernel.org
Fixes: 1d18538e6a092 ("tracing: Have dynamic events have a ref counter")
Reported-by: Krister Johansen <kjlx@templeofstupid.com>
Reviewed-by: Krister Johansen <kjlx@templeofstupid.com>
Tested-by: Krister Johansen <kjlx@templeofstupid.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Lukas Bulwahn [Thu, 11 Aug 2022 07:17:34 +0000 (09:17 +0200)]
tracing: React to error return from traceprobe_parse_event_name()
The function traceprobe_parse_event_name() may set the first two function
arguments to a non-null value and still return -EINVAL to indicate an
unsuccessful completion of the function. Hence, it is not sufficient to
just check the result of the two function arguments for being not null,
but the return value also needs to be checked.
Commit
95c104c378dc ("tracing: Auto generate event name when creating a
group of events") changed the error-return-value checking of the second
traceprobe_parse_event_name() invocation in __trace_eprobe_create() and
removed checking the return value to jump to the error handling case.
Reinstate using the return value in the error-return-value checking.
Link: https://lkml.kernel.org/r/20220811071734.20700-1-lukas.bulwahn@gmail.com
Fixes: 95c104c378dc ("tracing: Auto generate event name when creating a group of events")
Acked-by: Linyu Yuan <quic_linyyuan@quicinc.com>
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Linus Torvalds [Sun, 21 Aug 2022 18:18:33 +0000 (11:18 -0700)]
Merge tag 'i2c-for-6.0-rc2' of git://git./linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"A revert to fix a regression introduced this merge window and a fix
for proper error handling in the remove path of the iMX driver"
* tag 'i2c-for-6.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: imx: Make sure to unregister adapter on remove()
Revert "i2c: scmi: Replace open coded device_get_match_data()"
Linus Torvalds [Sun, 21 Aug 2022 17:21:16 +0000 (10:21 -0700)]
Merge tag '6.0-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs client fixes from Steve French:
- memory leak fix
- two small cleanups
- trivial strlcpy removal
- update missing entry for cifs headers in MAINTAINERS file
* tag '6.0-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: move from strlcpy with unused retval to strscpy
cifs: Fix memory leak on the deferred close
cifs: remove useless parameter 'is_fsctl' from SMB2_ioctl()
cifs: remove unused server parameter from calc_smb_size()
cifs: missing directory in MAINTAINERS file
Nick Desaulniers [Fri, 19 Aug 2022 19:06:40 +0000 (12:06 -0700)]
asm goto: eradicate CC_HAS_ASM_GOTO
GCC has supported asm goto since 4.5, and Clang has since version 9.0.0.
The minimum supported versions of these tools for the build according to
Documentation/process/changes.rst are 5.1 and 11.0.0 respectively.
Remove the feature detection script, Kconfig option, and clean up some
fallback code that is no longer supported.
The removed script was also testing for a GCC specific bug that was
fixed in the 4.7 release.
Also remove workarounds for bpftrace using clang older than 9.0.0, since
other BPF backend fixes are required at this point.
Link: https://lore.kernel.org/lkml/CAK7LNATSr=BXKfkdW8f-H5VT_w=xBpT2ZQcZ7rm6JfkdE+QnmA@mail.gmail.com/
Link: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48637
Acked-by: Borislav Petkov <bp@suse.de>
Suggested-by: Masahiro Yamada <masahiroy@kernel.org>
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Uwe Kleine-König [Wed, 20 Jul 2022 15:09:33 +0000 (17:09 +0200)]
i2c: imx: Make sure to unregister adapter on remove()
If for whatever reasons pm_runtime_resume_and_get() fails and .remove() is
exited early, the i2c adapter stays around and the irq still calls its
handler, while the driver data and the register mapping go away. So if
later the i2c adapter is accessed or the irq triggers this results in
havoc accessing freed memory and unmapped registers.
So unregister the software resources even if resume failed, and only skip
the hardware access in that case.
Fixes: 588eb93ea49f ("i2c: imx: add runtime pm support to improve the performance")
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Acked-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Wolfram Sang <wsa@kernel.org>
Wolfram Sang [Thu, 18 Aug 2022 20:31:13 +0000 (22:31 +0200)]
Revert "i2c: scmi: Replace open coded device_get_match_data()"
This reverts commit
9ae551ded5ba55f96a83cd0811f7ef8c2f329d0c. We got a
regression report, so ensure this machine boots again. We will come back
with a better version hopefully.
Reported-by: Josef Johansson <josef@oderland.se>
Link: https://lore.kernel.org/r/4d2d5b04-0b6c-1cb1-a63f-dc06dfe1b5da@oderland.se
Signed-off-by: Wolfram Sang <wsa@kernel.org>
Linus Torvalds [Sat, 20 Aug 2022 21:55:38 +0000 (14:55 -0700)]
Merge tag 'kbuild-fixes-v6.0' of git://git./linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Fix module versioning broken on some architectures
- Make dummy-tools enable CONFIG_PPC_LONG_DOUBLE_128
- Remove -Wformat-zero-length, which has no warning instance
- Fix the order between drivers and libs in modules.order
- Fix false-positive warnings in clang-analyzer
* tag 'kbuild-fixes-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
scripts/clang-tools: Remove DeprecatedOrUnsafeBufferHandling check
kbuild: fix the modules order between drivers and libs
scripts/Makefile.extrawarn: Do not disable clang's -Wformat-zero-length
kbuild: dummy-tools: pretend we understand __LONG_DOUBLE_128__
modpost: fix module versioning when a symbol lacks valid CRC
Linus Torvalds [Sat, 20 Aug 2022 21:46:05 +0000 (14:46 -0700)]
Merge tag 'perf-tools-fixes-for-v6.0-2022-08-19' of git://git./linux/kernel/git/acme/linux
Pull perf tools fixes from Arnaldo Carvalho de Melo:
- Fix alignment for cpu map masks in event encoding.
- Support reading PERF_FORMAT_LOST, perf tool counterpart for a feature
that was added in this merge window.
- Sync perf tools copies of kernel headers: socket, msr-index, fscrypt,
cpufeatures, i915_drm, kvm, vhost, perf_event.
* tag 'perf-tools-fixes-for-v6.0-2022-08-19' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
perf tools: Support reading PERF_FORMAT_LOST
libperf: Add a test case for read formats
libperf: Handle read format in perf_evsel__read()
tools headers UAPI: Sync linux/perf_event.h with the kernel sources
tools headers UAPI: Sync x86's asm/kvm.h with the kernel sources
tools headers UAPI: Sync KVM's vmx.h header with the kernel sources
tools include UAPI: Sync linux/vhost.h with the kernel sources
tools headers kvm s390: Sync headers with the kernel sources
tools headers UAPI: Sync linux/kvm.h with the kernel sources
tools headers UAPI: Sync drm/i915_drm.h with the kernel sources
tools headers cpufeatures: Sync with the kernel sources
tools headers UAPI: Sync linux/fscrypt.h with the kernel sources
tools arch x86: Sync the msr-index.h copy with the kernel sources
perf beauty: Update copy of linux/socket.h with the kernel sources
perf cpumap: Fix alignment for masks in event encoding
perf cpumap: Compute mask size in constant time
perf cpumap: Synthetic events and const/static
perf cpumap: Const map for max()
Linus Torvalds [Sat, 20 Aug 2022 18:29:01 +0000 (11:29 -0700)]
Merge tag 's390-6.0-1' of git://git./linux/kernel/git/s390/linux
Pull s390 updates from Alexander Gordeev:
- Fix a KVM crash on z12 and older machines caused by a wrong
assumption that Query AP Configuration Information is always
available.
- Lower severity of excessive Hypervisor filesystem error messages
when booting under KVM.
* tag 's390-6.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/ap: fix crash on older machines based on QCI info missing
s390/hypfs: avoid error message under KVM
Linus Torvalds [Sat, 20 Aug 2022 18:20:37 +0000 (11:20 -0700)]
Merge tag 'powerpc-6.0-3' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix atomic sleep warnings at boot due to get_phb_number() taking a
mutex with a spinlock held on some machines.
- Add missing PMU selftests to .gitignores.
Thanks to Guenter Roeck and Russell Currey.
* tag 'powerpc-6.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
selftests/powerpc: Add missing PMU selftests to .gitignores
powerpc/pci: Fix get_phb_number() locking
Linus Torvalds [Sat, 20 Aug 2022 17:49:02 +0000 (10:49 -0700)]
Merge tag 'for-linus' of git://git./linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"A few minor fixes:
- Fix buffer management in SRP to correct a regression with the login
authentication feature from v5.17
- Don't iterate over non-present ports in mlx5
- Fix an error introduced by the foritify work in cxgb4
- Two bug fixes for the recently merged ERDMA driver
- Unbreak RDMA dmabuf support, a regresion from v5.19"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA: Handle the return code from dma_resv_wait_timeout() properly
RDMA/erdma: Correct the max_qp and max_cq capacities of the device
RDMA/erdma: Using the key in FMR WR instead of MR structure
RDMA/cxgb4: fix accept failure due to increased cpl_t5_pass_accept_rpl size
RDMA/mlx5: Use the proper number of ports
IB/iser: Fix login with authentication
Guru Das Srinagesh [Thu, 4 Aug 2022 17:46:14 +0000 (10:46 -0700)]
scripts/clang-tools: Remove DeprecatedOrUnsafeBufferHandling check
This `clang-analyzer` check flags the use of memset(), suggesting a more
secure version of the API, such as memset_s(), which does not exist in
the kernel:
warning: Call to function 'memset' is insecure as it does not provide
security checks introduced in the C11 standard. Replace with analogous
functions that support length arguments or provides boundary checks such
as 'memset_s' in case of C11
[clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling]
Signed-off-by: Guru Das Srinagesh <quic_gurus@quicinc.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Masahiro Yamada [Sat, 13 Aug 2022 23:09:28 +0000 (08:09 +0900)]
kbuild: fix the modules order between drivers and libs
Commit
b2c885549122 ("kbuild: update modules.order only when contained
modules are updated") accidentally changed the modules order.
Prior to that commit, the modules order was determined based on
vmlinux-dirs, which lists core-y/m, drivers-y/m, libs-y/m, in this order.
Now, subdir-modorder lists them in a different order: core-y/m, libs-y/m,
drivers-y/m.
Presumably, there was no practical issue because the modules in drivers
and libs are orthogonal, but there is no reason to have this distortion.
Get back to the original order.
Fixes: b2c885549122 ("kbuild: update modules.order only when contained modules are updated")
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Nathan Chancellor [Wed, 10 Aug 2022 23:01:33 +0000 (16:01 -0700)]
scripts/Makefile.extrawarn: Do not disable clang's -Wformat-zero-length
There are no instances of this warning in the tree across several
difference architectures and configurations. This was added by
commit
26ea6bb1fef0 ("kbuild, LLVMLinux: Supress warnings unless W=1-3")
back in 2014, where it might have been necessary, but there are no
instances of it now so stop disabling it to increase warning coverage
for clang.
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Jiri Slaby [Wed, 10 Aug 2022 09:26:03 +0000 (11:26 +0200)]
kbuild: dummy-tools: pretend we understand __LONG_DOUBLE_128__
There is a test in powerpc's Kconfig which checks __LONG_DOUBLE_128__
and sets CONFIG_PPC_LONG_DOUBLE_128 if it is understood by the compiler.
We currently don't handle it, so this results in PPC_LONG_DOUBLE_128 not
being in super-config generated by dummy-tools. So take this into
account in the gcc script and preprocess __LONG_DOUBLE_128__ as "1".
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Masahiro Yamada [Tue, 9 Aug 2022 14:11:17 +0000 (23:11 +0900)]
modpost: fix module versioning when a symbol lacks valid CRC
Since commit
7b4537199a4a ("kbuild: link symbol CRCs at final link,
removing CONFIG_MODULE_REL_CRCS"), module versioning is broken on
some architectures. Loading a module fails with "disagrees about
version of symbol module_layout".
On such architectures (e.g. ARCH=sparc build with sparc64_defconfig),
modpost shows a warning, like follows:
WARNING: modpost: EXPORT symbol "_mcount" [vmlinux] version generation failed, symbol will not be versioned.
Is "_mcount" prototyped in <asm/asm-prototypes.h>?
Previously, it was a harmless warning (CRC check was just skipped),
but now wrong CRCs are used for comparison because invalid CRCs are
just skipped.
$ sparc64-linux-gnu-nm -n vmlinux
[snip]
0000000000c2cea0 r __ksymtab__kstrtol
0000000000c2ceb8 r __ksymtab__kstrtoul
0000000000c2ced0 r __ksymtab__local_bh_enable
0000000000c2cee8 r __ksymtab__mcount
0000000000c2cf00 r __ksymtab__printk
0000000000c2cf18 r __ksymtab__raw_read_lock
0000000000c2cf30 r __ksymtab__raw_read_lock_bh
[snip]
0000000000c53b34 D __crc__kstrtol
0000000000c53b38 D __crc__kstrtoul
0000000000c53b3c D __crc__local_bh_enable
0000000000c53b40 D __crc__printk
0000000000c53b44 D __crc__raw_read_lock
0000000000c53b48 D __crc__raw_read_lock_bh
Please notice __crc__mcount is missing here.
When the module subsystem looks up a CRC that comes after, it results
in reading out a wrong address. For example, when __crc__printk is
needed, the module subsystem reads 0xc53b44 instead of 0xc53b40.
All CRC entries must be output for correct index accessing. Invalid
CRCs will be unused, but are needed to keep the one-to-one mapping
between __ksymtab_* and __crc_*.
The best is to fix all modpost warnings, but several warnings are still
remaining on less popular architectures.
Fixes: 7b4537199a4a ("kbuild: link symbol CRCs at final link, removing CONFIG_MODULE_REL_CRCS")
Reported-by: matoro <matoro_mailinglist_kernel@matoro.tk>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Tested-by: matoro <matoro_mailinglist_kernel@matoro.tk>
Linus Torvalds [Sat, 20 Aug 2022 17:17:05 +0000 (10:17 -0700)]
Merge tag 'block-6.0-2022-08-19' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A few fixes that should go into this release:
- Small series of patches for ublk (ZiyangZhang)
- Remove dead function (Yu)
- Fix for running a block queue in case of resource starvation
(Yufen)"
* tag 'block-6.0-2022-08-19' of git://git.kernel.dk/linux-block:
blk-mq: run queue no matter whether the request is the last request
blk-mq: remove unused function blk_mq_queue_stopped()
ublk_drv: do not add a re-issued request aborted previously to ioucmd's task_work
ublk_drv: update comment for __ublk_fail_req()
ublk_drv: check ubq_daemon_is_dying() in __ublk_rq_task_work()
ublk_drv: update iod->addr for UBLK_IO_NEED_GET_DATA
Linus Torvalds [Sat, 20 Aug 2022 16:49:22 +0000 (09:49 -0700)]
Merge tag 'io_uring-6.0-2022-08-19' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"A few fixes for regressions in this cycle:
- Two instances of using the wrong "has async data" helper (Pavel)
- Fixup zero-copy address import (Pavel)
- Bump zero-copy notification slot limit (Pavel)"
* tag 'io_uring-6.0-2022-08-19' of git://git.kernel.dk/linux-block:
io_uring/net: use right helpers for async_data
io_uring/notif: raise limit on notification slots
io_uring/net: improve zc addr import error handling
io_uring/net: use right helpers for async recycle
Linus Torvalds [Sat, 20 Aug 2022 16:43:45 +0000 (09:43 -0700)]
Merge tag 'ata-6.0-rc2' of git://git./linux/kernel/git/dlemoal/libata
Pull ATA fixes from Damien Le Moal:
- Add a missing command name definition for ata_get_cmd_name(), from
me.
- A fix to address a performance regression due to the default
max_sectors queue limit for ATA devices connected to AHCI adapters
being too small, from John.
* tag 'ata-6.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
ata: libata: Set __ATA_BASE_SHT max_sectors
ata: libata-eh: Add missing command name
Linus Torvalds [Sat, 20 Aug 2022 16:39:00 +0000 (09:39 -0700)]
Merge tag 'mmc-v6.0-rc1' of git://git./linux/kernel/git/ulfh/mmc
Pull MMC host fixes from Ulf Hansson:
- meson-gx: Fix error handling in ->probe()
- mtk-sd: Fix a command problem when using cqe off/disable
- pxamci: Fix error handling in ->probe()
- sdhci-of-dwcmshc: Fix broken support for the BlueField-3 variant
* tag 'mmc-v6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: sdhci-of-dwcmshc: Re-enable support for the BlueField-3 SoC
mmc: meson-gx: Fix an error handling path in meson_mmc_probe()
mmc: mtk-sd: Clear interrupts when cqe off/disable
mmc: pxamci: Fix another error handling path in pxamci_probe()
mmc: pxamci: Fix an error handling path in pxamci_probe()
John Garry [Wed, 17 Aug 2022 15:20:08 +0000 (23:20 +0800)]
ata: libata: Set __ATA_BASE_SHT max_sectors
Commit
0568e6122574 ("ata: libata-scsi: cap ata_device->max_sectors
according to shost->max_sectors") inadvertently capped the max_sectors
value for some SATA disks to a value which is lower than we would want.
For a device which supports LBA48, we would previously have request queue
max_sectors_kb and max_hw_sectors_kb values of 1280 and 32767 respectively.
For AHCI controllers, the value chosen for shost max sectors comes from
the minimum of the SCSI host default max sectors in
SCSI_DEFAULT_MAX_SECTORS (1024) and the shost DMA device mapping limit.
This means that we would now set the max_sectors_kb and max_hw_sectors_kb
values for a disk which supports LBA48 at 512, ignoring DMA mapping limit.
As report by Oliver at [0], this caused a performance regression.
Fix by picking a large enough max sectors value for ATA host controllers
such that we don't needlessly reduce max_sectors_kb for LBA48 disks.
[0] https://lore.kernel.org/linux-ide/YvsGbidf3na5FpGb@xsang-OptiPlex-9020/T/#m22d9fc5ad15af66066dd9fecf3d50f1b1ef11da3
Fixes: 0568e6122574 ("ata: libata-scsi: cap ata_device->max_sectors according to shost->max_sectors")
Reported-by: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Linus Torvalds [Fri, 19 Aug 2022 21:02:24 +0000 (14:02 -0700)]
Merge tag 'execve-v6.0-rc2' of git://git./linux/kernel/git/kees/linux
Pull execve fix from Kees Cook:
- Replace remaining kmap() uses with kmap_local_page() (Fabio M. De
Francesco)
* tag 'execve-v6.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
exec: Replace kmap{,_atomic}() with kmap_local_page()
Linus Torvalds [Fri, 19 Aug 2022 20:56:14 +0000 (13:56 -0700)]
Merge tag 'hardening-v6.0-rc2' of git://git./linux/kernel/git/kees/linux
Pull hardening fixes from Kees Cook:
- Also undef LATENT_ENTROPY_PLUGIN for per-file disabling (Andrew
Donnellan)
- Return EFAULT on copy_from_user() failures in LoadPin (Kees Cook)
* tag 'hardening-v6.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
gcc-plugins: Undefine LATENT_ENTROPY_PLUGIN when plugin disabled for a file
LoadPin: Return EFAULT on copy_from_user() failures
Linus Torvalds [Fri, 19 Aug 2022 20:49:07 +0000 (13:49 -0700)]
Merge tag 'riscv-for-linus-6.0-rc2' of git://git./linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- A fix to make the ISA extension static keys writable after init. This
manifests at least as a crash when loading modules (including KVM).
- A fixup for a build warning related to a poorly formed comment in our
perf driver.
* tag 'riscv-for-linus-6.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
perf: riscv legacy: fix kerneldoc comment warning
riscv: Ensure isa-ext static keys are writable
Linus Torvalds [Fri, 19 Aug 2022 20:40:11 +0000 (13:40 -0700)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Fix unexpected sign extension of KVM_ARM_DEVICE_ID_MASK
- Tidy-up handling of AArch32 on asymmetric systems
x86:
- Fix 'missing ENDBR' BUG for fastop functions
Generic:
- Some cleanup and static analyzer patches
- More fixes to KVM_CREATE_VM unwind paths"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: Drop unnecessary initialization of "ops" in kvm_ioctl_create_device()
KVM: Drop unnecessary initialization of "npages" in hva_to_pfn_slow()
x86/kvm: Fix "missing ENDBR" BUG for fastop functions
x86/kvm: Simplify FOP_SETCC()
x86/ibt, objtool: Add IBT_NOSEAL()
KVM: Rename mmu_notifier_* to mmu_invalidate_*
KVM: Rename KVM_PRIVATE_MEM_SLOTS to KVM_INTERNAL_MEM_SLOTS
KVM: MIPS: remove unnecessary definition of KVM_PRIVATE_MEM_SLOTS
KVM: Move coalesced MMIO initialization (back) into kvm_create_vm()
KVM: Unconditionally get a ref to /dev/kvm module when creating a VM
KVM: Properly unwind VM creation if creating debugfs fails
KVM: arm64: Reject 32bit user PSTATE on asymmetric systems
KVM: arm64: Treat PMCR_EL1.LC as RES1 on asymmetric systems
KVM: arm64: Fix compile error due to sign extension
Linus Torvalds [Fri, 19 Aug 2022 20:33:48 +0000 (13:33 -0700)]
Merge tag 'for-6.0-rc1-tag' of git://git./linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few short fixes and a lockdep warning fix (needs moving some code):
- tree-log replay fixes:
- fix error handling when looking up extent refs
- fix warning when setting inode number of links
- relocation fixes:
- reset block group read-only status when relocation fails
- unset control structure if transaction fails when starting
to process a block group
- add lockdep annotations to fix a warning during relocation
where blocks temporarily belong to another tree and can lead
to reversed dependencies
- tree-checker verifies that extent items don't overlap"
* tag 'for-6.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: tree-checker: check for overlapping extent items
btrfs: fix warning during log replay when bumping inode link count
btrfs: fix lost error handling when looking up extended ref on log replay
btrfs: fix lockdep splat with reloc root extent buffers
btrfs: move lockdep class helpers to locking.c
btrfs: unset reloc control if transaction commit fails in prepare_to_relocate()
btrfs: reset RO counter on block group if we fail to relocate
Linus Torvalds [Fri, 19 Aug 2022 20:26:52 +0000 (13:26 -0700)]
Merge tag '5.20-rc2-ksmbd-smb3-server-fixes' of git://git.samba.org/ksmbd
Pull ksmbd server fixes from Steve French:
- important sparse file fix
- allocation size fix
- fix incorrect rc on bad share
- share config fix
* tag '5.20-rc2-ksmbd-smb3-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: don't remove dos attribute xattr on O_TRUNC open
ksmbd: remove unnecessary generic_fillattr in smb2_open
ksmbd: request update to stale share config
ksmbd: return STATUS_BAD_NETWORK_NAME error status if share is not configured
Namhyung Kim [Fri, 19 Aug 2022 00:36:44 +0000 (17:36 -0700)]
perf tools: Support reading PERF_FORMAT_LOST
The recent kernel added lost count can be read from either read(2) or
ring buffer data with PERF_SAMPLE_READ. As it's a variable length data
we need to access it according to the format info.
But for perf tools use cases, PERF_FORMAT_ID is always set. So we can
only check PERF_FORMAT_LOST bit to determine the data format.
Add sample_read_value_size() and next_sample_read_value() helpers to
make it a bit easier to access. Use them in all places where it reads
the struct sample_read_value.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220819003644.508916-5-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Fri, 19 Aug 2022 00:36:43 +0000 (17:36 -0700)]
libperf: Add a test case for read formats
It checks a various combination of the read format settings and verify
it return the value in a proper position. The test uses task-clock
software events to guarantee it's always active and sets enabled/running
time.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220819003644.508916-4-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Fri, 19 Aug 2022 00:36:42 +0000 (17:36 -0700)]
libperf: Handle read format in perf_evsel__read()
The perf_counts_values should be increased to read the new lost data.
Also adjust values after read according the read format.
This supports PERF_FORMAT_GROUP which has a different data format but
it's only available for leader events. Currently it doesn't have an API
to read sibling (member) events in the group. But users may read the
sibling event directly.
Also reading from mmap would be disabled when the read format has ID or
LOST bit as it's not exposed via mmap.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220819003644.508916-3-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>