Yu Kuai [Wed, 2 Nov 2022 02:25:40 +0000 (10:25 +0800)]
block, bfq: cleanup bfq_activate_requeue_entity()
Just make the code a litter cleaner by removing the unnecessary
variable 'sd'.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@unimore.it>
Link: https://lore.kernel.org/r/20221102022542.3621219-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Wed, 2 Nov 2022 02:25:39 +0000 (10:25 +0800)]
block, bfq: factor out code to update 'active_entities'
Current code is a bit ugly and hard to read.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@unimore.it>
Link: https://lore.kernel.org/r/20221102022542.3621219-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Wed, 2 Nov 2022 02:25:38 +0000 (10:25 +0800)]
block, bfq: remove set but not used variable in __bfq_entity_update_weight_prio
After the patch "block, bfq: cleanup bfq_weights_tree add/remove apis"),
the local variable 'bfqd' is not used anymore, thus remove it.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20221102022542.3621219-2-yukuai1@huaweicloud.com
Fixes:
afdba1461262 ("block, bfq: cleanup bfq_weights_tree add/remove apis")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Tue, 18 Oct 2022 11:12:40 +0000 (19:12 +0800)]
block: Replace struct rq_depth with unsigned int in struct iolatency_grp
We only need a max queue depth for every iolatency to limit the inflight io
number. Replace struct rq_depth with unsigned int to simplfy "struct
iolatency_grp" and save memory.
Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20221018111240.22612-4-shikemeng@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Tue, 18 Oct 2022 11:12:39 +0000 (19:12 +0800)]
block: Correct comment for scale_cookie_change
Default queue depth of iolatency_grp is unlimited, so we scale down
quickly(once by half) in scale_cookie_change. Remove the "subtract
1/16th" part which is not the truth and add the actual way we
scale down.
Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Link: https://lore.kernel.org/r/20221018111240.22612-3-shikemeng@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Tue, 18 Oct 2022 11:12:38 +0000 (19:12 +0800)]
block: Remove redundant parent blkcg_gp check in check_scale_change
Function blkcg_iolatency_throttle will make sure blkg->parent is not
NULL before calls check_scale_change. And function check_scale_change
is only called in blkcg_iolatency_throttle.
Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20221018111240.22612-2-shikemeng@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Sun, 30 Oct 2022 10:07:14 +0000 (11:07 +0100)]
block: split elevator_switch
Split an elevator_disable helper from elevator_switch for the case where
we want to switch to no scheduler at all. This includes removing the
pointless elevator_switch_mq helper and removing the switch to no
schedule logic from blk_mq_init_sched.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221030100714.876891-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Sun, 30 Oct 2022 10:07:13 +0000 (11:07 +0100)]
block: don't check for required features in elevator_match
Checking for the required features in the callers simplifies the code
quite a bit, so do that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221030100714.876891-7-hch@lst.de
[axboe: adjust for dropping patch 1, use __elevator_find()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Sun, 30 Oct 2022 10:07:12 +0000 (11:07 +0100)]
block: simplify the check for the current elevator in elv_iosched_show
Just compare the pointers instead of using the string based
elevator_match.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221030100714.876891-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Sun, 30 Oct 2022 10:07:11 +0000 (11:07 +0100)]
block: cleanup the variable naming in elv_iosched_store
Use eq for the elevator_queue as done elsewhere. This frees e to be used
for the loop iterator instead of the odd __ prefix. In addition rename
elv to cur to make it more clear it is the currently selected elevator.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221030100714.876891-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Sun, 30 Oct 2022 10:07:10 +0000 (11:07 +0100)]
block: exit elv_iosched_show early when I/O schedulers are not supported
If the tag_set has BLK_MQ_F_NO_SCHED flag set we will never show any
scheduler, so exit early.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221030100714.876891-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Sun, 30 Oct 2022 10:07:09 +0000 (11:07 +0100)]
block: cleanup elevator_get
Do the request_module and repeated lookup in the only caller that cares,
pick a saner name that explains where are actually doing a lookup and
use a sane calling conventions that passes the queue first.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221030100714.876891-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Fri, 16 Sep 2022 07:19:42 +0000 (15:19 +0800)]
block, bfq: cleanup __bfq_weights_tree_remove()
It's the same with bfq_weights_tree_remove() now.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220916071942.214222-7-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Fri, 16 Sep 2022 07:19:41 +0000 (15:19 +0800)]
block, bfq: cleanup bfq_weights_tree add/remove apis
The 'bfq_data' and 'rb_root_cached' can both be accessed through
'bfq_queue', thus only pass 'bfq_queue' as parameter.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220916071942.214222-6-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Fri, 16 Sep 2022 07:19:40 +0000 (15:19 +0800)]
block, bfq: do not idle if only one group is activated
Now that root group is counted into 'num_groups_with_pending_reqs',
'num_groups_with_pending_reqs > 0' is always true in
bfq_asymmetric_scenario(). Thus change the condition to '> 1'.
On the other hand, this change can enable concurrent sync io if only
one group is activated.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220916071942.214222-5-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Fri, 16 Sep 2022 07:19:39 +0000 (15:19 +0800)]
block, bfq: refactor the counting of 'num_groups_with_pending_reqs'
Currently, bfq can't handle sync io concurrently as long as they
are not issued from root group. This is because
'bfqd->num_groups_with_pending_reqs > 0' is always true in
bfq_asymmetric_scenario().
The way that bfqg is counted into 'num_groups_with_pending_reqs':
Before this patch:
1) root group will never be counted.
2) Count if bfqg or it's child bfqgs have pending requests.
3) Don't count if bfqg and it's child bfqgs complete all the requests.
After this patch:
1) root group is counted.
2) Count if bfqg have pending requests.
3) Don't count if bfqg complete all the requests.
With this change, the occasion that only one group is activated can be
detected, and next patch will support concurrent sync io in the
occasion.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220916071942.214222-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Fri, 16 Sep 2022 07:19:38 +0000 (15:19 +0800)]
block, bfq: record how many queues have pending requests
Prepare to refactor the counting of 'num_groups_with_pending_reqs'.
Add a counter in bfq_group, update it while tracking if bfqq have pending
requests and when bfq_bfqq_move() is called.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220916071942.214222-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Fri, 16 Sep 2022 07:19:37 +0000 (15:19 +0800)]
block, bfq: support to track if bfqq has pending requests
If entity belongs to bfqq, then entity->in_groups_with_pending_reqs
is not used currently. This patch use it to track if bfqq has pending
requests through callers of weights_tree insertion and removal.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220916071942.214222-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jinlong Chen [Sun, 30 Oct 2022 08:32:12 +0000 (16:32 +0800)]
blk-mq: remove redundant call to blk_freeze_queue_start in blk_mq_destroy_queue
The calling relationship in blk_mq_destroy_queue() is as follows:
blk_mq_destroy_queue()
...
-> blk_queue_start_drain()
-> blk_freeze_queue_start() <- called
...
-> blk_freeze_queue()
-> blk_freeze_queue_start() <- called again
-> blk_mq_freeze_queue_wait()
...
So there is a redundant call to blk_freeze_queue_start().
Replace blk_freeze_queue() with blk_mq_freeze_queue_wait() to avoid the
redundant call.
Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221030083212.1251255-1-nickyc975@zju.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jinlong Chen [Sun, 30 Oct 2022 09:47:30 +0000 (17:47 +0800)]
blk-mq: move queue_is_mq out of blk_mq_cancel_work_sync
The only caller that needs queue_is_mq check is del_gendisk, so move the
check into it.
Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221030094730.1275463-1-nickyc975@zju.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dawei Li [Sun, 30 Oct 2022 05:20:08 +0000 (13:20 +0800)]
block: simplify blksize_bits() implementation
Convert current looping-based implementation into bit operation,
which can bring improvement for:
1) bitops is more efficient for its arch-level optimization.
2) Given that blksize_bits() is inline, _if_ @size is compile-time
constant, it's possible that order_base_2() _may_ make output
compile-time evaluated, depending on code context and compiler behavior.
Signed-off-by: Dawei Li <set_pte_at@outlook.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/TYCP286MB23238842958D7C083D6B67CECA349@TYCP286MB2323.JPNP286.PROD.OUTLOOK.COM
Signed-off-by: Jens Axboe <axboe@kernel.dk>
David Jeffery [Wed, 26 Oct 2022 05:19:57 +0000 (13:19 +0800)]
blk-mq: avoid double ->queue_rq() because of early timeout
David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.
So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.
But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: virtualization@lists.linux-foundation.org
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221026051957.358818-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Tue, 25 Oct 2022 19:17:55 +0000 (12:17 -0700)]
block: Micro-optimize get_max_segment_size()
This patch removes a conditional jump from get_max_segment_size(). The
x86-64 assembler code for this function without this patch is as follows:
206 return min_not_zero(mask - offset + 1,
0x0000000000000118 <+72>: not %rax
0x000000000000011b <+75>: and 0x8(%r10),%rax
0x000000000000011f <+79>: add $0x1,%rax
0x0000000000000123 <+83>: je 0x138 <bvec_split_segs+104>
0x0000000000000125 <+85>: cmp %rdx,%rax
0x0000000000000128 <+88>: mov %rdx,%r12
0x000000000000012b <+91>: cmovbe %rax,%r12
0x000000000000012f <+95>: test %rdx,%rdx
0x0000000000000132 <+98>: mov %eax,%edx
0x0000000000000134 <+100>: cmovne %r12d,%edx
With this patch applied:
206 return min(mask - offset, (unsigned long)lim->max_segment_size - 1) + 1;
0x000000000000003f <+63>: mov 0x28(%rdi),%ebp
0x0000000000000042 <+66>: not %rax
0x0000000000000045 <+69>: and 0x8(%rdi),%rax
0x0000000000000049 <+73>: sub $0x1,%rbp
0x000000000000004d <+77>: cmp %rbp,%rax
0x0000000000000050 <+80>: cmova %rbp,%rax
0x0000000000000054 <+84>: add $0x1,%eax
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221025191755.1711437-4-bvanassche@acm.org
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Tue, 25 Oct 2022 19:17:54 +0000 (12:17 -0700)]
block: Constify most queue limits pointers
Document which functions do not modify the queue limits.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221025191755.1711437-3-bvanassche@acm.org
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Tue, 25 Oct 2022 19:17:53 +0000 (12:17 -0700)]
block: Remove request.write_hint
Commit
c75e707fe1aa ("block: remove the per-bio/request write hint")
removed all code that uses the struct request write_hint member. Hence
also remove 'write_hint' itself.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221025191755.1711437-2-bvanassche@acm.org
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 25 Oct 2022 15:59:16 +0000 (08:59 -0700)]
block: remove bio_start_io_acct_time
bio_start_io_acct_time is not actually used anywhere, so remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221025155916.270303-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 18 Oct 2022 13:57:20 +0000 (15:57 +0200)]
nvme-apple: remove an extra queue reference
Now that blk_mq_destroy_queue does not release the queue reference, there
is no need for a second admin queue reference to be held by the
apple_nvme structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Sven Peter <sven@svenpeter.dev>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20221018135720.670094-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 18 Oct 2022 13:57:19 +0000 (15:57 +0200)]
nvme-pci: remove an extra queue reference
Now that blk_mq_destroy_queue does not release the queue reference, there
is no need for a second admin queue reference to be held by the nvme_dev.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20221018135720.670094-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 18 Oct 2022 13:57:18 +0000 (15:57 +0200)]
scsi: remove an extra queue reference
Now that blk_mq_destroy_queue does not release the queue reference, there
is no need for a second queue reference to be held by the scsi_device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20221018135720.670094-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 18 Oct 2022 13:57:17 +0000 (15:57 +0200)]
blk-mq: move the call to blk_put_queue out of blk_mq_destroy_queue
The fact that blk_mq_destroy_queue also drops a queue reference leads
to various places having to grab an extra reference. Move the call to
blk_put_queue into the callers to allow removing the extra references.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20221018135720.670094-2-hch@lst.de
[axboe: fix fabrics_q vs admin_q conflict in nvme core.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jinlong Chen [Thu, 20 Oct 2022 06:48:19 +0000 (08:48 +0200)]
block: fix up elevator_type refcounting
The current reference management logic of io scheduler modules contains
refcnt problems. For example, blk_mq_init_sched may fail before or after
the calling of e->ops.init_sched. If it fails before the calling, it does
nothing to the reference to the io scheduler module. But if it fails after
the calling, it releases the reference by calling kobject_put(&eq->kobj).
As the callers of blk_mq_init_sched can't know exactly where the failure
happens, they can't handle the reference to the io scheduler module
properly: releasing the reference on failure results in double-release if
blk_mq_init_sched has released it, and not releasing the reference results
in ghost reference if blk_mq_init_sched did not release it either.
The same problem also exists in io schedulers' init_sched implementations.
We can address the problem by adding releasing statements to the error
handling procedures of blk_mq_init_sched and init_sched implementations.
But that is counterintuitive and requires modifications to existing io
schedulers.
Instead, We make elevator_alloc get the io scheduler module references
that will be released by elevator_release. And then, we match each
elevator_get with an elevator_put. Therefore, each reference to an io
scheduler module explicitly has its own getter and releaser, and we no
longer need to worry about the refcnt problems.
The bugs and the patch can be validated with tools here:
https://github.com/nickyc975/linux_elv_refcnt_bug.git
[hch: split out a few bits into separate patches, use a non-try
module_get in elevator_alloc]
Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221020064819.1469928-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jinlong Chen [Thu, 20 Oct 2022 06:48:18 +0000 (08:48 +0200)]
block: check for an unchanged elevator earlier in __elevator_change
No need to find the actual elevator_type struct for this comparism,
the name is all that is needed.
Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221020064819.1469928-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 20 Oct 2022 06:48:17 +0000 (08:48 +0200)]
block: sanitize the elevator name before passing it to __elevator_change
The stripped name should also be used for the none check. To do so
strip it in the caller and pass in the sanitized name. Drop the pointless
__ prefix in the function name while we're at it.
Based on a patch from Jinlong Chen <nickyc975@zju.edu.cn>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221020064819.1469928-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 20 Oct 2022 06:48:16 +0000 (08:48 +0200)]
block: add proper helpers for elevator_type module refcount management
Make sure we have helpers for all relevant module refcount operations on
the elevator_type in elevator.h, and use them. Move the call to the get
helper in blk_mq_elv_switch_none a bit so that it is obvious with a less
verbose comment.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221020064819.1469928-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Wed, 19 Oct 2022 12:15:18 +0000 (20:15 +0800)]
blk-wbt: don't enable throttling if default elevator is bfq
Commit
b5dc5d4d1f4f ("block,bfq: Disable writeback throttling") tries to
disable wbt for bfq, it's done by calling wbt_disable_default() in
bfq_init_queue(). However, wbt is still enabled if default elevator is
bfq:
device_add_disk
elevator_init_mq
bfq_init_queue
wbt_disable_default -> done nothing
blk_register_queue
wbt_enable_default -> wbt is enabled
Fix the problem by adding a new flag ELEVATOR_FLAG_DISBALE_WBT, bfq
will set the flag in bfq_init_queue, and following wbt_enable_default()
won't enable wbt while the flag is set.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221019121518.3865235-7-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Wed, 19 Oct 2022 12:15:17 +0000 (20:15 +0800)]
elevator: add new field flags in struct elevator_queue
There are only one flag to indicate that elevator is registered currently,
prepare to add a flag to disable wbt if default elevator is bfq.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221019121518.3865235-6-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Wed, 19 Oct 2022 12:15:16 +0000 (20:15 +0800)]
blk-wbt: don't show valid wbt_lat_usec in sysfs while wbt is disabled
Currently, if wbt is initialized and then disabled by
wbt_disable_default(), sysfs will still show valid wbt_lat_usec, which
will confuse users that wbt is still enabled.
This patch shows wbt_lat_usec as zero if it's disabled.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reported-and-tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221019121518.3865235-5-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Wed, 19 Oct 2022 12:15:15 +0000 (20:15 +0800)]
blk-wbt: make enable_state more accurate
Currently, if user disable wbt through sysfs, 'enable_state' will be
'WBT_STATE_ON_MANUAL', which will be confusing. Add a new state
'WBT_STATE_OFF_MANUAL' to cover that case.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221019121518.3865235-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Wed, 19 Oct 2022 12:15:14 +0000 (20:15 +0800)]
blk-wbt: remove unnecessary check in wbt_enable_default()
If CONFIG_BLK_WBT_MQ is disabled, wbt_init() won't do anything.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221019121518.3865235-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Wed, 19 Oct 2022 12:15:13 +0000 (20:15 +0800)]
elevator: remove redundant code in elv_unregister_queue()
"elevator_queue *e" is already declared and initialized in the beginning
of elv_unregister_queue().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221019121518.3865235-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Wed, 12 Oct 2022 09:40:35 +0000 (17:40 +0800)]
blk-iocost: read 'ioc->params' inside 'ioc->lock' in ioc_timer_fn()
'ioc->params' is updated in ioc_refresh_params(), which is proteced by
'ioc->lock', however, ioc_timer_fn() read params outside the lock.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20221012094035.390056-5-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Wed, 12 Oct 2022 09:40:34 +0000 (17:40 +0800)]
blk-iocost: prevent configuration update concurrent with io throttling
This won't cause any severe problem currently, however, this doesn't
seems appropriate:
1) 'ioc->params' is read from multiple places without holding
'ioc->lock', unexpected value might be read if writing it concurrently.
2) If configuration is changed while io is throttling, the functionality
might be affected. For example, if module params is updated and cost
becomes smaller, waiting for timer that is caculated under old
configuration is not appropriate.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20221012094035.390056-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Wed, 12 Oct 2022 09:40:33 +0000 (17:40 +0800)]
blk-iocost: don't release 'ioc->lock' while updating params
ioc_qos_write() and ioc_cost_model_write() are the same:
1) hold lock to read 'ioc->params' to local variable;
2) update params to local variable without lock;
3) hold lock to write local variable to 'ioc->params';
In theroy, if user updates params concurrenty, the params might be lost:
t1: update params a t2: update params b
spin_lock_irq(&ioc->lock);
memcpy(qos, ioc->params.qos, sizeof(qos))
spin_unlock_irq(&ioc->lock);
qos[a] = xxx;
spin_lock_irq(&ioc->lock);
memcpy(qos, ioc->params.qos, sizeof(qos))
spin_unlock_irq(&ioc->lock);
qos[b] = xxx;
spin_lock_irq(&ioc->lock);
memcpy(ioc->params.qos, qos, sizeof(qos));
ioc_refresh_params(ioc, true);
spin_unlock_irq(&ioc->lock);
spin_lock_irq(&ioc->lock);
// updates of a will be lost
memcpy(ioc->params.qos, qos, sizeof(qos));
ioc_refresh_params(ioc, true);
spin_unlock_irq(&ioc->lock);
Althrough this is not common case, the problem can by fixed easily by
holding the lock through the read, update, write process.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20221012094035.390056-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Wed, 12 Oct 2022 09:40:32 +0000 (17:40 +0800)]
blk-iocost: disable writeback throttling
Commit
b5dc5d4d1f4f ("block,bfq: Disable writeback throttling") disable
wbt for bfq, because different write-throttling heuristics should not
work together.
For the same reason, wbt and iocost should not work together as well,
unless admin really want to do that, dispite that performance is
affected.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20221012094035.390056-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Sun, 23 Oct 2022 22:27:33 +0000 (15:27 -0700)]
Linux 6.1-rc2
Linus Torvalds [Sun, 23 Oct 2022 22:00:43 +0000 (15:00 -0700)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"RISC-V:
- Fix compilation without RISCV_ISA_ZICBOM
- Fix kvm_riscv_vcpu_timer_pending() for Sstc
ARM:
- Fix a bug preventing restoring an ITS containing mappings for very
large and very sparse device topology
- Work around a relocation handling error when compiling the nVHE
object with profile optimisation
- Fix for stage-2 invalidation holding the VM MMU lock for too long
by limiting the walk to the largest block mapping size
- Enable stack protection and branch profiling for VHE
- Two selftest fixes
x86:
- add compat implementation for KVM_X86_SET_MSR_FILTER ioctl
selftests:
- synchronize includes between include/uapi and tools/include/uapi"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
tools: include: sync include/api/linux/kvm.h
KVM: x86: Add compat handler for KVM_X86_SET_MSR_FILTER
KVM: x86: Copy filter arg outside kvm_vm_ioctl_set_msr_filter()
kvm: Add support for arch compat vm ioctls
RISC-V: KVM: Fix kvm_riscv_vcpu_timer_pending() for Sstc
RISC-V: Fix compilation without RISCV_ISA_ZICBOM
KVM: arm64: vgic: Fix exit condition in scan_its_table()
KVM: arm64: nvhe: Fix build with profile optimization
KVM: selftests: Fix number of pages for memory slot in memslot_modification_stress_test
KVM: arm64: selftests: Fix multiple versions of GIC creation
KVM: arm64: Enable stack protection and branch profiling for VHE
KVM: arm64: Limit stage2_apply_range() batch size to largest block
KVM: arm64: Work out supported block level at compile time
Jason A. Donenfeld [Sat, 8 Oct 2022 15:47:00 +0000 (09:47 -0600)]
Revert "mfd: syscon: Remove repetition of the regmap_get_val_endian()"
This reverts commit
72a95859728a7866522e6633818bebc1c2519b17.
It broke reboots on big-endian MIPS and MIPS64 malta QEMU instances,
which use the syscon driver. Little-endian is not effected, which means
likely it's important to handle regmap_get_val_endian() in this function
after all.
Fixes:
72a95859728a ("mfd: syscon: Remove repetition of the regmap_get_val_endian()")
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Lee Jones <lee@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 23 Oct 2022 19:01:01 +0000 (12:01 -0700)]
kernel/utsname_sysctl.c: Fix hostname polling
Commit
bfca3dd3d068 ("kernel/utsname_sysctl.c: print kernel arch") added
a new entry to the uts_kern_table[] array, but didn't update the
UTS_PROC_xyz enumerators of older entries, breaking anything that used
them.
Which is admittedly not many cases: it's really just the two uses of
uts_proc_notify() in kernel/sys.c. But apparently journald-systemd
actually uses this to detect hostname changes.
Reported-by: Torsten Hilbrich <torsten.hilbrich@secunet.com>
Fixes:
bfca3dd3d068 ("kernel/utsname_sysctl.c: print kernel arch")
Link: https://lore.kernel.org/lkml/0c2b92a6-0f25-9538-178f-eee3b06da23f@secunet.com/
Link: https://linux-regtracking.leemhuis.info/regzbot/regression/0c2b92a6-0f25-9538-178f-eee3b06da23f@secunet.com/
Cc: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 23 Oct 2022 17:14:45 +0000 (10:14 -0700)]
Merge tag 'perf_urgent_for_v6.1_rc2' of git://git./linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Fix raw data handling when perf events are used in bpf
- Rework how SIGTRAPs get delivered to events to address a bunch of
problems with it. Add a selftest for that too
* tag 'perf_urgent_for_v6.1_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
bpf: Fix sample_flags for bpf_perf_event_output
selftests/perf_events: Add a SIGTRAP stress test with disables
perf: Fix missing SIGTRAPs
Linus Torvalds [Sun, 23 Oct 2022 17:10:55 +0000 (10:10 -0700)]
Merge tag 'sched_urgent_for_v6.1_rc2' of git://git./linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Adjust code to not trip up CFI
- Fix sched group cookie matching
* tag 'sched_urgent_for_v6.1_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Introduce struct balance_callback to avoid CFI mismatches
sched/core: Fix comparison in sched_group_cookie_match()
Linus Torvalds [Sun, 23 Oct 2022 17:07:01 +0000 (10:07 -0700)]
Merge tag 'objtool_urgent_for_v6.1_rc2' of git://git./linux/kernel/git/tip/tip
Pull objtool fix from Borislav Petkov:
- Fix ORC stack unwinding when GCOV is enabled
* tag 'objtool_urgent_for_v6.1_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/unwind/orc: Fix unreliable stack dump with gcov
Linus Torvalds [Sun, 23 Oct 2022 17:01:34 +0000 (10:01 -0700)]
Merge tag 'x86_urgent_for_v6.0_rc2' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
"As usually the case, right after a major release, the tip urgent
branches accumulate a couple more fixes than normal. And here is the
x86, a bit bigger, urgent pile.
- Use the correct CPU capability clearing function on the error path
in Intel perf LBR
- A CFI fix to ftrace along with a simplification
- Adjust handling of zero capacity bit mask for resctrl cache
allocation on AMD
- A fix to the AMD microcode loader to attempt patch application on
every logical thread
- A couple of topology fixes to handle CPUID leaf 0x1f enumeration
info properly
- Drop a -mabi=ms compiler option check as both compilers support it
now anyway
- A couple of fixes to how the initial, statically allocated FPU
buffer state is setup and its interaction with dynamic states at
runtime"
* tag 'x86_urgent_for_v6.0_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Fix copy_xstate_to_uabi() to copy init states correctly
perf/x86/intel/lbr: Use setup_clear_cpu_cap() instead of clear_cpu_cap()
ftrace,kcfi: Separate ftrace_stub() and ftrace_stub_graph()
x86/ftrace: Remove ftrace_epilogue()
x86/resctrl: Fix min_cbm_bits for AMD
x86/microcode/AMD: Apply the patch early on every logical thread
x86/topology: Fix duplicated core ID within a package
x86/topology: Fix multiple packages shown on a single-package system
hwmon/coretemp: Handle large core ID value
x86/Kconfig: Drop check for -mabi=ms for CONFIG_EFI_STUB
x86/fpu: Exclude dynamic states from init_fpstate
x86/fpu: Fix the init_fpstate size check with the actual size
x86/fpu: Configure init_fpstate attributes orderly
Linus Torvalds [Sun, 23 Oct 2022 16:55:50 +0000 (09:55 -0700)]
Merge tag 'io_uring-6.1-2022-10-22' of git://git.kernel.dk/linux
Pull io_uring follow-up from Jens Axboe:
"Currently the zero-copy has automatic fallback to normal transmit, and
it was decided that it'd be cleaner to return an error instead if the
socket type doesn't support it.
Zero-copy does work with UDP and TCP, it's more of a future proofing
kind of thing (eg for samba)"
* tag 'io_uring-6.1-2022-10-22' of git://git.kernel.dk/linux:
io_uring/net: fail zc sendmsg when unsupported by socket
io_uring/net: fail zc send when unsupported by socket
net: flag sockets supporting msghdr originated zerocopy
Linus Torvalds [Sat, 22 Oct 2022 23:04:34 +0000 (16:04 -0700)]
Merge tag 'hwmon-for-v6.1-rc2' of git://git./linux/kernel/git/groeck/linux-staging
Pull hwmon fixes from Guenter Roeck:
- corsair-psu: Fix typo in USB id description, and add USB ID for new
PSU
- pwm-fan: Fix fan power handling when disabling fan control
* tag 'hwmon-for-v6.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (corsair-psu) Add USB id of the new HX1500i psu
hwmon: (pwm-fan) Explicitly switch off fan power when setting pwm1_enable to 0
hwmon: (corsair-psu) fix typo in USB id description
Linus Torvalds [Sat, 22 Oct 2022 22:59:46 +0000 (15:59 -0700)]
Merge tag 'i2c-for-6.1-rc2' of git://git./linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"RPM fix for qcom-cci, platform module alias for xiic, build warning
fix for mlxbf, typo fixes in comments"
* tag 'i2c-for-6.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: mlxbf: depend on ACPI; clean away ifdeffage
i2c: fix spelling typos in comments
i2c: qcom-cci: Fix ordering of pm_runtime_xx and i2c_add_adapter
i2c: xiic: Add platform module alias
Linus Torvalds [Sat, 22 Oct 2022 22:52:36 +0000 (15:52 -0700)]
Merge tag 'pci-v6.1-fixes-2' of git://git./linux/kernel/git/helgaas/pci
Pull pci fixes from Bjorn Helgaas:
- Revert a simplification that broke pci-tegra due to a masking error
- Update MAINTAINERS for Kishon's email address change and TI
DRA7XX/J721E maintainer change
* tag 'pci-v6.1-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
MAINTAINERS: Update Kishon's email address in PCI endpoint subsystem
MAINTAINERS: Add Vignesh Raghavendra as maintainer of TI DRA7XX/J721E PCI driver
Revert "PCI: tegra: Use PCI_CONF1_EXT_ADDRESS() macro"
Linus Torvalds [Sat, 22 Oct 2022 22:30:15 +0000 (15:30 -0700)]
Merge tag 'media/v6.1-2' of git://git./linux/kernel/git/mchehab/linux-media
Pull missed media updates from Mauro Carvalho Chehab:
"It seems I screwed-up my previous pull request: it ends up that only
half of the media patches that were in linux-next got merged in -rc1.
The script which creates the signed tags silently failed due to
5.19->6.0 so it ended generating a tag with incomplete stuff.
So here are the missing parts:
- a DVB core security fix
- lots of fixes and cleanups for atomisp staging driver
- old drivers that are VB1 are being moved to staging to be
deprecated
- several driver updates - mostly for embedded systems, but there are
also some things addressing issues with some PC webcams, in the UVC
video driver"
* tag 'media/v6.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (163 commits)
media: sun6i-csi: Move csi buffer definition to main header file
media: sun6i-csi: Introduce and use video helper functions
media: sun6i-csi: Add media ops with link notify callback
media: sun6i-csi: Remove controls handler from the driver
media: sun6i-csi: Register the media device after creation
media: sun6i-csi: Pass and store csi device directly in video code
media: sun6i-csi: Tidy up video code
media: sun6i-csi: Tidy up v4l2 code
media: sun6i-csi: Tidy up Kconfig
media: sun6i-csi: Use runtime pm for clocks and reset
media: sun6i-csi: Define and use variant to get module clock rate
media: sun6i-csi: Always set exclusive module clock rate
media: sun6i-csi: Tidy up platform code
media: sun6i-csi: Refactor main driver data structures
media: sun6i-csi: Define and use driver name and (reworked) description
media: cedrus: Add a Kconfig dependency on RESET_CONTROLLER
media: sun8i-rotate: Add a Kconfig dependency on RESET_CONTROLLER
media: sun8i-di: Add a Kconfig dependency on RESET_CONTROLLER
media: sun4i-csi: Add a Kconfig dependency on RESET_CONTROLLER
media: sun6i-csi: Add a Kconfig dependency on RESET_CONTROLLER
...
Pavel Begunkov [Fri, 21 Oct 2022 10:16:41 +0000 (11:16 +0100)]
io_uring/net: fail zc sendmsg when unsupported by socket
The previous patch fails zerocopy send requests for protocols that don't
support it, do the same for zerocopy sendmsg.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0854e7bb4c3d810a48ec8b5853e2f61af36a0467.1666346426.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 21 Oct 2022 10:16:40 +0000 (11:16 +0100)]
io_uring/net: fail zc send when unsupported by socket
If a protocol doesn't support zerocopy it will silently fall back to
copying. This type of behaviour has always been a source of troubles
so it's better to fail such requests instead.
Cc: <stable@vger.kernel.org> # 6.0
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2db3c7f16bb6efab4b04569cd16e6242b40c5cb3.1666346426.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 21 Oct 2022 10:16:39 +0000 (11:16 +0100)]
net: flag sockets supporting msghdr originated zerocopy
We need an efficient way in io_uring to check whether a socket supports
zerocopy with msghdr provided ubuf_info. Add a new flag into the struct
socket flags fields.
Cc: <stable@vger.kernel.org> # 6.0
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/3dafafab822b1c66308bb58a0ac738b1e3f53f74.1666346426.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Wilken Gottwalt [Sat, 8 Oct 2022 11:35:34 +0000 (11:35 +0000)]
hwmon: (corsair-psu) Add USB id of the new HX1500i psu
Also update the documentation accordingly.
Signed-off-by: Wilken Gottwalt <wilken.gottwalt@posteo.net>
Link: https://lore.kernel.org/r/Y0FghqQCHG/cX5Jz@monster.localdomain
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Paolo Bonzini [Sat, 22 Oct 2022 11:43:52 +0000 (07:43 -0400)]
tools: include: sync include/api/linux/kvm.h
Provide a definition of KVM_CAP_DIRTY_LOG_RING_ACQ_REL.
Fixes:
17601bfed909 ("KVM: Add KVM_CAP_DIRTY_LOG_RING_ACQ_REL capability and config option")
Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Alexander Graf [Mon, 17 Oct 2022 18:45:41 +0000 (20:45 +0200)]
KVM: x86: Add compat handler for KVM_X86_SET_MSR_FILTER
The KVM_X86_SET_MSR_FILTER ioctls contains a pointer in the passed in
struct which means it has a different struct size depending on whether
it gets called from 32bit or 64bit code.
This patch introduces compat code that converts from the 32bit struct to
its 64bit counterpart which then gets used going forward internally.
With this applied, 32bit QEMU can successfully set MSR bitmaps when
running on 64bit kernels.
Reported-by: Andrew Randrianasulu <randrianasulu@gmail.com>
Fixes:
1a155254ff937 ("KVM: x86: Introduce MSR filtering")
Signed-off-by: Alexander Graf <graf@amazon.com>
Message-Id: <
20221017184541.2658-4-graf@amazon.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Alexander Graf [Mon, 17 Oct 2022 18:45:40 +0000 (20:45 +0200)]
KVM: x86: Copy filter arg outside kvm_vm_ioctl_set_msr_filter()
In the next patch we want to introduce a second caller to
set_msr_filter() which constructs its own filter list on the stack.
Refactor the original function so it takes it as argument instead of
reading it through copy_from_user().
Signed-off-by: Alexander Graf <graf@amazon.com>
Message-Id: <
20221017184541.2658-3-graf@amazon.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Alexander Graf [Mon, 17 Oct 2022 18:45:39 +0000 (20:45 +0200)]
kvm: Add support for arch compat vm ioctls
We will introduce the first architecture specific compat vm ioctl in the
next patch. Add all necessary boilerplate to allow architectures to
override compat vm ioctls when necessary.
Signed-off-by: Alexander Graf <graf@amazon.com>
Message-Id: <
20221017184541.2658-2-graf@amazon.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Sat, 22 Oct 2022 07:33:58 +0000 (03:33 -0400)]
Merge tag 'kvm-riscv-fixes-6.1-1' of https://github.com/kvm-riscv/linux into HEAD
KVM/riscv fixes for 6.1, take #1
- Fix compilation without RISCV_ISA_ZICBOM
- Fix kvm_riscv_vcpu_timer_pending() for Sstc
Paolo Bonzini [Sat, 22 Oct 2022 07:33:26 +0000 (03:33 -0400)]
Merge tag 'kvmarm-fixes-6.1-2' of git://git./linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.1, take #2
- Fix a bug preventing restoring an ITS containing mappings
for very large and very sparse device topology
- Work around a relocation handling error when compiling
the nVHE object with profile optimisation
Paolo Bonzini [Sat, 22 Oct 2022 07:32:23 +0000 (03:32 -0400)]
Merge tag 'kvmarm-fixes-6.1-1' of git://git./linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.1, take #1
- Fix for stage-2 invalidation holding the VM MMU lock
for too long by limiting the walk to the largest
block mapping size
- Enable stack protection and branch profiling for VHE
- Two selftest fixes
Linus Torvalds [Sat, 22 Oct 2022 01:26:00 +0000 (18:26 -0700)]
Merge tag 'thermal-6.1-rc2' of git://git./linux/kernel/git/rafael/linux-pm
Pull thermal control fix from Rafael Wysocki:
"This fixes the control CPU selection in the intel_powerclamp thermal
driver"
* tag 'thermal-6.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
thermal: intel_powerclamp: Use first online CPU as control_cpu
Linus Torvalds [Sat, 22 Oct 2022 01:19:42 +0000 (18:19 -0700)]
Merge tag 'pm-6.1-rc2' of git://git./linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix some issues and clean up code in ARM cpufreq drivers.
Specifics:
- Fix module loading in the Tegra124 cpufreq driver (Jon Hunter)
- Fix memory leak and update to read-only region in the qcom cpufreq
driver (Fabien Parent)
- Miscellaneous minor cleanups to cpufreq drivers (Fabien Parent,
Yang Yingliang)"
* tag 'pm-6.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: sun50i: Switch to use dev_err_probe() helper
cpufreq: qcom-nvmem: Switch to use dev_err_probe() helper
cpufreq: imx6q: Switch to use dev_err_probe() helper
cpufreq: dt: Switch to use dev_err_probe() helper
cpufreq: qcom: remove unused parameter in function definition
cpufreq: qcom: fix writes in read-only memory region
cpufreq: qcom: fix memory leak in error path
cpufreq: tegra194: Fix module loading
Linus Torvalds [Sat, 22 Oct 2022 01:08:30 +0000 (18:08 -0700)]
Merge tag 'acpi-6.1-rc2' of git://git./linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These fix issues introduced during this merge window (ACPI/PCI, device
enumeration and documentation) and some other ones found recently.
Specifics:
- Add missing device reference counting to acpi_get_pci_dev() after
changing it recently (Rafael Wysocki)
- Fix resource list walk in acpi_dma_get_range() (Robin Murphy)
- Add IRQ override quirk for LENOVO IdeaPad and extend the IRQ
override warning message (Jiri Slaby)
- Fix integer overflow in ghes_estatus_pool_init() (Ashish Kalra)
- Fix multiple error records handling in one of the ACPI extlog
driver code paths (Tony Luck)
- Prune DSDT override documentation from index after dropping it
(Bagas Sanjaya)"
* tag 'acpi-6.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: scan: Fix DMA range assignment
ACPI: PCI: Fix device reference counting in acpi_get_pci_dev()
ACPI: resource: note more about IRQ override
ACPI: resource: do IRQ override on LENOVO IdeaPad
ACPI: extlog: Handle multiple records
ACPI: APEI: Fix integer overflow in ghes_estatus_pool_init()
Documentation: ACPI: Prune DSDT override documentation from index
Linus Torvalds [Sat, 22 Oct 2022 01:02:36 +0000 (18:02 -0700)]
Merge tag 'efi-fixes-for-v6.1-1' of git://git./linux/kernel/git/efi/efi
Pull EFI fixes from Ard Biesheuvel:
- fixes for the EFI variable store refactor that landed in v6.0
- fixes for issues that were introduced during the merge window
- back out some changes related to EFI zboot signing - we'll add a
better solution for this during the next cycle
* tag 'efi-fixes-for-v6.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efi: runtime: Don't assume virtual mappings are missing if VA == PA == 0
efi: libstub: Fix incorrect payload size in zboot header
efi: libstub: Give efi_main() asmlinkage qualification
efi: efivars: Fix variable writes without query_variable_store()
efi: ssdt: Don't free memory if ACPI table was loaded successfully
efi: libstub: Remove zboot signing from build options
Linus Torvalds [Sat, 22 Oct 2022 00:47:39 +0000 (17:47 -0700)]
Merge tag 'iommu-fixes-v6.1-rc1' of git://git./linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:
"Intel VT-d fixes:
- Fix a lockdep splat issue in intel_iommu_init()
- Allow NVS regions to pass RMRR check
- Domain cleanup in error path"
* tag 'iommu-fixes-v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/vt-d: Clean up si_domain in the init_dmars() error path
iommu/vt-d: Allow NVS regions in arch_rmrr_sanity_check()
iommu/vt-d: Use rcu_lock in get_resv_regions
iommu: Add gfp parameter to iommu_alloc_resv_region
Linus Torvalds [Sat, 22 Oct 2022 00:41:57 +0000 (17:41 -0700)]
Merge tag 'for-linus-
2022102101' of git://git./linux/kernel/git/hid/hid
Pull HID fixes from Benjamin Tissoires:
- a 12 year old bug fix for the Apple Magic Trackpad v1 (José Expósito)
- a fix for a potential crash on removal of the Playstation controllers
(Roderick Colenbrander)
- a few new device IDs and device-specific quirks, most notably support
of the new Playstation DualSense Edge controller
* tag 'for-linus-
2022102101' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
HID: lenovo: Make array tp10ubkbd_led static const
HID: saitek: add madcatz variant of MMO7 mouse device ID
HID: playstation: support updated DualSense rumble mode.
HID: playstation: add initial DualSense Edge controller support
HID: playstation: stop DualSense output work on remove.
HID: magicmouse: Do not set BTN_MOUSE on double report
Linus Torvalds [Fri, 21 Oct 2022 23:01:53 +0000 (16:01 -0700)]
Merge tag '6.1-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
- memory leak fixes
- fixes for directory leases, including an important one which fixes a
problem noticed by git functional tests
- fixes relating to missing free_xid calls (helpful for
tracing/debugging of entry/exit into cifs.ko)
- a multichannel fix
- a small cleanup fix (use of list_move instead of list_del/list_add)
* tag '6.1-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal module number
cifs: fix memory leaks in session setup
cifs: drop the lease for cached directories on rmdir or rename
smb3: interface count displayed incorrectly
cifs: Fix memory leak when build ntlmssp negotiate blob failed
cifs: set rc to -ENOENT if we can not get a dentry for the cached dir
cifs: use LIST_HEAD() and list_move() to simplify code
cifs: Fix xid leak in cifs_get_file_info_unix()
cifs: Fix xid leak in cifs_ses_add_channel()
cifs: Fix xid leak in cifs_flock()
cifs: Fix xid leak in cifs_copy_file_range()
cifs: Fix xid leak in cifs_create()
Linus Torvalds [Fri, 21 Oct 2022 22:51:30 +0000 (15:51 -0700)]
Merge tag 'nfsd-6.1-2' of git://git./linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
"Fixes for patches merged in v6.1"
* tag 'nfsd-6.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: ensure we always call fh_verify_error tracepoint
NFSD: unregister shrinker when nfsd_init_net() fails
Chang S. Bae [Fri, 21 Oct 2022 18:58:44 +0000 (11:58 -0700)]
x86/fpu: Fix copy_xstate_to_uabi() to copy init states correctly
When an extended state component is not present in fpstate, but in init
state, the function copies from init_fpstate via copy_feature().
But, dynamic states are not present in init_fpstate because of all-zeros
init states. Then retrieving them from init_fpstate will explode like this:
BUG: kernel NULL pointer dereference, address:
0000000000000000
...
RIP: 0010:memcpy_erms+0x6/0x10
? __copy_xstate_to_uabi_buf+0x381/0x870
fpu_copy_guest_fpstate_to_uabi+0x28/0x80
kvm_arch_vcpu_ioctl+0x14c/0x1460 [kvm]
? __this_cpu_preempt_check+0x13/0x20
? vmx_vcpu_put+0x2e/0x260 [kvm_intel]
kvm_vcpu_ioctl+0xea/0x6b0 [kvm]
? kvm_vcpu_ioctl+0xea/0x6b0 [kvm]
? __fget_light+0xd4/0x130
__x64_sys_ioctl+0xe3/0x910
? debug_smp_processor_id+0x17/0x20
? fpregs_assert_state_consistent+0x27/0x50
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Adjust the 'mask' to zero out the userspace buffer for the features that
are not available both from fpstate and from init_fpstate.
The dynamic features depend on the compacted XSAVE format. Ensure it is
enabled before reading XCOMP_BV in init_fpstate.
Fixes:
2308ee57d93d ("x86/fpu/amx: Enable the AMX feature in 64-bit mode")
Reported-by: Yuan Yao <yuan.yao@intel.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/lkml/BYAPR11MB3717EDEF2351C958F2C86EED95259@BYAPR11MB3717.namprd11.prod.outlook.com/
Link: https://lkml.kernel.org/r/20221021185844.13472-1-chang.seok.bae@intel.com
Linus Torvalds [Fri, 21 Oct 2022 22:19:43 +0000 (15:19 -0700)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Two small changes, one in the lpfc driver and the other in the core.
The core change is an additional footgun guard which prevents users
from writing the wrong state to sysfs and causing a hang"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: lpfc: Fix memory leak in lpfc_create_port()
scsi: core: Restrict legal sdev_state transitions via sysfs
Linus Torvalds [Fri, 21 Oct 2022 22:14:14 +0000 (15:14 -0700)]
Merge tag 'block-6.1-2022-10-20' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- NVMe pull request via Christoph:
- fix nvme-hwmon for DMA non-cohehrent architectures (Serge Semin)
- add a nvme-hwmong maintainer (Christoph Hellwig)
- fix error pointer dereference in error handling (Dan Carpenter)
- fix invalid memory reference in nvmet_subsys_attr_qid_max_show
(Daniel Wagner)
- don't limit the DMA segment size in nvme-apple (Russell King)
- fix workqueue MEM_RECLAIM flushing dependency (Sagi Grimberg)
- disable write zeroes on various Kingston SSDs (Xander Li)
- fix a memory leak with block device tracing (Ye)
- flexible-array fix for ublk (Yushan)
- document the ublk recovery feature from this merge window
(ZiyangZhang)
- remove dead bfq variable in struct (Yuwei)
- error handling rq clearing fix (Yu)
- add an IRQ safety check for the cached bio freeing (Pavel)
- drbd bio cloning fix (Christoph)
* tag 'block-6.1-2022-10-20' of git://git.kernel.dk/linux:
blktrace: remove unnessary stop block trace in 'blk_trace_shutdown'
blktrace: fix possible memleak in '__blk_trace_remove'
blktrace: introduce 'blk_trace_{start,stop}' helper
bio: safeguard REQ_ALLOC_CACHE bio put
block, bfq: remove unused variable for bfq_queue
drbd: only clone bio if we have a backing device
ublk_drv: use flexible-array member instead of zero-length array
nvmet: fix invalid memory reference in nvmet_subsys_attr_qid_max_show
nvmet: fix workqueue MEM_RECLAIM flushing dependency
nvme-hwmon: kmalloc the NVME SMART log buffer
nvme-hwmon: consistently ignore errors from nvme_hwmon_init
nvme: add Guenther as nvme-hwmon maintainer
nvme-apple: don't limit DMA segement size
nvme-pci: disable write zeroes on various Kingston SSD
nvme: fix error pointer dereference in error handling
Documentation: document ublk user recovery feature
blk-mq: fix null pointer dereference in blk_mq_clear_rq_mapping()
Linus Torvalds [Fri, 21 Oct 2022 22:09:10 +0000 (15:09 -0700)]
Merge tag 'io_uring-6.1-2022-10-20' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
- Fix a potential memory leak in the error handling path of io-wq setup
(Rafael)
- Kill an errant debug statement that got added in this release (me)
- Fix an oops with an invalid direct descriptor with IORING_OP_MSG_RING
(Harshit)
- Remove unneeded FFS_SCM flagging (Pavel)
- Remove polling off the exit path (Pavel)
- Move out direct descriptor debug check to the cleanup path (Pavel)
- Use the proper helper rather than open-coding cached request get
(Pavel)
* tag 'io_uring-6.1-2022-10-20' of git://git.kernel.dk/linux:
io-wq: Fix memory leak in worker creation
io_uring/msg_ring: Fix NULL pointer dereference in io_msg_send_fd()
io_uring/rw: remove leftover debug statement
io_uring: don't iopoll from io_ring_ctx_wait_and_kill()
io_uring: reuse io_alloc_req()
io_uring: kill hot path fixed file bitmap debug checks
io_uring: remove FFS_SCM
Linus Torvalds [Fri, 21 Oct 2022 21:43:09 +0000 (14:43 -0700)]
Merge tag 'for-linus-6.1-rc2-tag' of git://git./linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"Just two fixes for the new 'virtio with grants' feature"
* tag 'for-linus-6.1-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/virtio: Convert PAGE_SIZE/PAGE_SHIFT/PFN_UP to Xen counterparts
xen/virtio: Handle cases when page offset > PAGE_SIZE properly
Linus Torvalds [Fri, 21 Oct 2022 21:33:36 +0000 (14:33 -0700)]
Merge tag 'selinux-pr-
20221020' of git://git./linux/kernel/git/pcmoore/selinux
Pull selinux fix from Paul Moore:
"A small SELinux fix for a GFP_KERNEL allocation while a spinlock is
held.
The patch, while still fairly small, is a bit larger than one might
expect from a simple s/GFP_KERNEL/GFP_ATOMIC/ conversion because we
added support for the function to be called with different gfp flags
depending on the context, preserving GFP_KERNEL for those cases that
can safely sleep"
* tag 'selinux-pr-
20221020' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
selinux: enable use of both GFP_KERNEL and GFP_ATOMIC in convert_context()
Linus Torvalds [Fri, 21 Oct 2022 19:33:03 +0000 (12:33 -0700)]
Merge tag 'mm-hotfixes-stable-2022-10-20' of git://git./linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morron:
"Seventeen hotfixes, mainly for MM.
Five are cc:stable and the remainder address post-6.0 issues"
* tag 'mm-hotfixes-stable-2022-10-20' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
nouveau: fix migrate_to_ram() for faulting page
mm/huge_memory: do not clobber swp_entry_t during THP split
hugetlb: fix memory leak associated with vma_lock structure
mm/page_alloc: reduce potential fragmentation in make_alloc_exact()
mm: /proc/pid/smaps_rollup: fix maple tree search
mm,hugetlb: take hugetlb_lock before decrementing h->resv_huge_pages
mm/mmap: fix MAP_FIXED address return on VMA merge
mm/mmap.c: __vma_adjust(): suppress uninitialized var warning
mm/mmap: undo ->mmap() when mas_preallocate() fails
init: Kconfig: fix spelling mistake "satify" -> "satisfy"
ocfs2: clear dinode links count in case of error
ocfs2: fix BUG when iput after ocfs2_mknod fails
gcov: support GCC 12.1 and newer compilers
zsmalloc: zs_destroy_pool: add size_class NULL check
mm/mempolicy: fix mbind_range() arguments to vma_merge()
mailmap: update email for Qais Yousef
mailmap: update Dan Carpenter's email address
Linus Torvalds [Fri, 21 Oct 2022 19:29:52 +0000 (12:29 -0700)]
Merge tag 'trace-tools-6.1-rc1' of git://git./linux/kernel/git/trace/linux-trace
Pull tracing tool update from Steven Rostedt:
- Make dot2c generate monitor's automata definition static
* tag 'trace-tools-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rv/dot2c: Make automaton definition static
Linus Torvalds [Fri, 21 Oct 2022 19:25:39 +0000 (12:25 -0700)]
Merge tag 'linux-watchdog-6.1-rc2' of git://linux-watchdog.org/linux-watchdog
Pull watchdog updates from Wim Van Sebroeck:
- Add tracing events for the most common watchdog events
* tag 'linux-watchdog-6.1-rc2' of git://www.linux-watchdog.org/linux-watchdog:
watchdog: Add tracing events for the most usual watchdog events
Rafael J. Wysocki [Fri, 21 Oct 2022 18:07:41 +0000 (20:07 +0200)]
Merge branches 'acpi-scan', 'acpi-resource', 'acpi-apei', 'acpi-extlog' and 'acpi-docs'
Merge assorted ACPI fixes for 6.1-rc2:
- Fix resource list walk in acpi_dma_get_range() (Robin Murphy).
- Add IRQ override quirk for LENOVO IdeaPad and extend the IRQ
override warning message (Jiri Slaby).
- Fix integer overflow in ghes_estatus_pool_init() (Ashish Kalra).
- Fix multiple error records handling in one of the ACPI extlog driver
code paths (Tony Luck).
- Prune DSDT override documentation from index after dropping it (Bagas
Sanjaya).
* acpi-scan:
ACPI: scan: Fix DMA range assignment
* acpi-resource:
ACPI: resource: note more about IRQ override
ACPI: resource: do IRQ override on LENOVO IdeaPad
* acpi-apei:
ACPI: APEI: Fix integer overflow in ghes_estatus_pool_init()
* acpi-extlog:
ACPI: extlog: Handle multiple records
* acpi-docs:
Documentation: ACPI: Prune DSDT override documentation from index
Chen Zhongjin [Wed, 27 Jul 2022 03:15:06 +0000 (11:15 +0800)]
x86/unwind/orc: Fix unreliable stack dump with gcov
When a console stack dump is initiated with CONFIG_GCOV_PROFILE_ALL
enabled, show_trace_log_lvl() gets out of sync with the ORC unwinder,
causing the stack trace to show all text addresses as unreliable:
# echo l > /proc/sysrq-trigger
[ 477.521031] sysrq: Show backtrace of all active CPUs
[ 477.523813] NMI backtrace for cpu 0
[ 477.524492] CPU: 0 PID: 1021 Comm: bash Not tainted 6.0.0 #65
[ 477.525295] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-1.fc36 04/01/2014
[ 477.526439] Call Trace:
[ 477.526854] <TASK>
[ 477.527216] ? dump_stack_lvl+0xc7/0x114
[ 477.527801] ? dump_stack+0x13/0x1f
[ 477.528331] ? nmi_cpu_backtrace.cold+0xb5/0x10d
[ 477.528998] ? lapic_can_unplug_cpu+0xa0/0xa0
[ 477.529641] ? nmi_trigger_cpumask_backtrace+0x16a/0x1f0
[ 477.530393] ? arch_trigger_cpumask_backtrace+0x1d/0x30
[ 477.531136] ? sysrq_handle_showallcpus+0x1b/0x30
[ 477.531818] ? __handle_sysrq.cold+0x4e/0x1ae
[ 477.532451] ? write_sysrq_trigger+0x63/0x80
[ 477.533080] ? proc_reg_write+0x92/0x110
[ 477.533663] ? vfs_write+0x174/0x530
[ 477.534265] ? handle_mm_fault+0x16f/0x500
[ 477.534940] ? ksys_write+0x7b/0x170
[ 477.535543] ? __x64_sys_write+0x1d/0x30
[ 477.536191] ? do_syscall_64+0x6b/0x100
[ 477.536809] ? entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 477.537609] </TASK>
This happens when the compiled code for show_stack() has a single word
on the stack, and doesn't use a tail call to show_stack_log_lvl().
(CONFIG_GCOV_PROFILE_ALL=y is the only known case of this.) Then the
__unwind_start() skip logic hits an off-by-one bug and fails to unwind
all the way to the intended starting frame.
Fix it by reverting the following commit:
f1d9a2abff66 ("x86/unwind/orc: Don't skip the first frame for inactive tasks")
The original justification for that commit no longer exists. That
original issue was later fixed in a different way, with the following
commit:
f2ac57a4c49d ("x86/unwind/orc: Fix inactive tasks with stack pointer in %sp on GCC 10 compiled kernels")
Fixes:
f1d9a2abff66 ("x86/unwind/orc: Don't skip the first frame for inactive tasks")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
[jpoimboe: rewrite commit log]
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Ard Biesheuvel [Thu, 20 Oct 2022 13:16:09 +0000 (15:16 +0200)]
efi: runtime: Don't assume virtual mappings are missing if VA == PA == 0
The generic EFI stub can be instructed to avoid SetVirtualAddressMap(),
and simply run with the firmware's 1:1 mapping. In this case, it
populates the virtual address fields of the runtime regions in the
memory map with the physical address of each region, so that the mapping
code has to be none the wiser. Only if SetVirtualAddressMap() fails, the
virtual addresses are wiped and the kernel code knows that the regions
cannot be mapped.
However, wiping amounts to setting it to zero, and if a runtime region
happens to live at physical address 0, its valid 1:1 mapped virtual
address could be mistaken for a wiped field, resulting on loss of access
to the EFI services at runtime.
So let's only assume that VA == 0 means 'no runtime services' if the
region in question does not live at PA 0x0.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Ard Biesheuvel [Thu, 20 Oct 2022 09:26:42 +0000 (11:26 +0200)]
efi: libstub: Fix incorrect payload size in zboot header
The linker script symbol definition that captures the size of the
compressed payload inside the zboot decompressor (which is exposed via
the image header) refers to '.' for the end of the region, which does
not give the correct result as the expression is not placed at the end
of the payload. So use the symbol name explicitly.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Ard Biesheuvel [Fri, 14 Oct 2022 17:29:57 +0000 (19:29 +0200)]
efi: libstub: Give efi_main() asmlinkage qualification
To stop the bots from sending sparse warnings to me and the list about
efi_main() not having a prototype, decorate it with asmlinkage so that
it is clear that it is called from assembly, and therefore needs to
remain external, even if it is never declared in a header file.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Ard Biesheuvel [Wed, 19 Oct 2022 21:29:58 +0000 (23:29 +0200)]
efi: efivars: Fix variable writes without query_variable_store()
Commit
bbc6d2c6ef22 ("efi: vars: Switch to new wrapper layer")
refactored the efivars layer so that the 'business logic' related to
which UEFI variables affect the boot flow in which way could be moved
out of it, and into the efivarfs driver.
This inadvertently broke setting variables on firmware implementations
that lack the QueryVariableInfo() boot service, because we no longer
tolerate a EFI_UNSUPPORTED result from check_var_size() when calling
efivar_entry_set_get_size(), which now ends up calling check_var_size()
a second time inadvertently.
If QueryVariableInfo() is missing, we support writes of up to 64k -
let's move that logic into check_var_size(), and drop the redundant
call.
Cc: <stable@vger.kernel.org> # v6.0
Fixes:
bbc6d2c6ef22 ("efi: vars: Switch to new wrapper layer")
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Ard Biesheuvel [Fri, 14 Oct 2022 10:25:52 +0000 (12:25 +0200)]
efi: ssdt: Don't free memory if ACPI table was loaded successfully
Amadeusz reports KASAN use-after-free errors introduced by commit
3881ee0b1edc ("efi: avoid efivars layer when loading SSDTs from
variables"). The problem appears to be that the memory that holds the
new ACPI table is now freed unconditionally, instead of only when the
ACPI core reported a failure to load the table.
So let's fix this, by omitting the kfree() on success.
Cc: <stable@vger.kernel.org> # v6.0
Link: https://lore.kernel.org/all/a101a10a-4fbb-5fae-2e3c-76cf96ed8fbd@linux.intel.com/
Fixes:
3881ee0b1edc ("efi: avoid efivars layer when loading SSDTs from variables")
Reported-by: Amadeusz Sławiński <amadeuszx.slawinski@linux.intel.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Ard Biesheuvel [Mon, 17 Oct 2022 10:48:46 +0000 (12:48 +0200)]
efi: libstub: Remove zboot signing from build options
The zboot decompressor series introduced a feature to sign the PE/COFF
kernel image for secure boot as part of the kernel build. This was
necessary because there are actually two images that need to be signed:
the kernel with the EFI stub attached, and the decompressor application.
This is a bit of a burden, because it means that the images must be
signed on the the same system that performs the build, and this is not
realistic for distros.
During the next cycle, we will introduce changes to the zboot code so
that the inner image no longer needs to be signed. This means that the
outer PE/COFF image can be handled as usual, and be signed later in the
release process.
Let's remove the associated Kconfig options now so that they don't end
up in a LTS release while already being deprecated.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Jerry Snitselaar [Wed, 19 Oct 2022 00:44:47 +0000 (08:44 +0800)]
iommu/vt-d: Clean up si_domain in the init_dmars() error path
A splat from kmem_cache_destroy() was seen with a kernel prior to
commit
ee2653bbe89d ("iommu/vt-d: Remove domain and devinfo mempool")
when there was a failure in init_dmars(), because the iommu_domain
cache still had objects. While the mempool code is now gone, there
still is a leak of the si_domain memory if init_dmars() fails. So
clean up si_domain in the init_dmars() error path.
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Will Deacon <will@kernel.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Fixes:
86080ccc223a ("iommu/vt-d: Allocate si_domain in init_dmars()")
Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>
Link: https://lore.kernel.org/r/20221010144842.308890-1-jsnitsel@redhat.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Charlotte Tan [Wed, 19 Oct 2022 00:44:46 +0000 (08:44 +0800)]
iommu/vt-d: Allow NVS regions in arch_rmrr_sanity_check()
arch_rmrr_sanity_check() warns if the RMRR is not covered by an ACPI
Reserved region, but it seems like it should accept an NVS region as
well. The ACPI spec
https://uefi.org/specs/ACPI/6.5/15_System_Address_Map_Interfaces.html
uses similar wording for "Reserved" and "NVS" region types; for NVS
regions it says "This range of addresses is in use or reserved by the
system and must not be used by the operating system."
There is an old comment on this mailing list that also suggests NVS
regions should pass the arch_rmrr_sanity_check() test:
The warnings come from arch_rmrr_sanity_check() since it checks whether
the region is E820_TYPE_RESERVED. However, if the purpose of the check
is to detect RMRR has regions that may be used by OS as free memory,
isn't E820_TYPE_NVS safe, too?
This patch overlaps with another proposed patch that would add the region
type to the log since sometimes the bug reporter sees this log on the
console but doesn't know to include the kernel log:
https://lore.kernel.org/lkml/
20220611204859.234975-3-atomlin@redhat.com/
Here's an example of the "Firmware Bug" apparent false positive (wrapped
for line length):
DMAR: [Firmware Bug]: No firmware reserved region can cover this RMRR
[0x000000006f760000-0x000000006f762fff], contact BIOS vendor for
fixes
DMAR: [Firmware Bug]: Your BIOS is broken; bad RMRR
[0x000000006f760000-0x000000006f762fff]
This is the snippet from the e820 table:
BIOS-e820: [mem 0x0000000068bff000-0x000000006ebfefff] reserved
BIOS-e820: [mem 0x000000006ebff000-0x000000006f9fefff] ACPI NVS
BIOS-e820: [mem 0x000000006f9ff000-0x000000006fffefff] ACPI data
Fixes:
f036c7fa0ab6 ("iommu/vt-d: Check VT-d RMRR region in BIOS is reported as reserved")
Cc: Will Mortensen <will@extrahop.com>
Link: https://lore.kernel.org/linux-iommu/64a5843d-850d-e58c-4fc2-0a0eeeb656dc@nec.com/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216443
Signed-off-by: Charlotte Tan <charlotte@extrahop.com>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Link: https://lore.kernel.org/r/20220929044449.32515-1-charlotte@extrahop.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Lu Baolu [Wed, 19 Oct 2022 00:44:45 +0000 (08:44 +0800)]
iommu/vt-d: Use rcu_lock in get_resv_regions
Commit
5f64ce5411b46 ("iommu/vt-d: Duplicate iommu_resv_region objects
per device list") converted rcu_lock in get_resv_regions to
dmar_global_lock to allow sleeping in iommu_alloc_resv_region(). This
introduced possible recursive locking if get_resv_regions is called from
within a section where intel_iommu_init() already holds dmar_global_lock.
Especially, after commit
57365a04c921 ("iommu: Move bus setup to IOMMU
device registration"), below lockdep splats could always be seen.
============================================
WARNING: possible recursive locking detected
6.0.0-rc4+ #325 Tainted: G I
--------------------------------------------
swapper/0/1 is trying to acquire lock:
ffffffffa8a18c90 (dmar_global_lock){++++}-{3:3}, at:
intel_iommu_get_resv_regions+0x25/0x270
but task is already holding lock:
ffffffffa8a18c90 (dmar_global_lock){++++}-{3:3}, at:
intel_iommu_init+0x36d/0x6ea
...
Call Trace:
<TASK>
dump_stack_lvl+0x48/0x5f
__lock_acquire.cold.73+0xad/0x2bb
lock_acquire+0xc2/0x2e0
? intel_iommu_get_resv_regions+0x25/0x270
? lock_is_held_type+0x9d/0x110
down_read+0x42/0x150
? intel_iommu_get_resv_regions+0x25/0x270
intel_iommu_get_resv_regions+0x25/0x270
iommu_create_device_direct_mappings.isra.28+0x8d/0x1c0
? iommu_get_dma_cookie+0x6d/0x90
bus_iommu_probe+0x19f/0x2e0
iommu_device_register+0xd4/0x130
intel_iommu_init+0x3e1/0x6ea
? iommu_setup+0x289/0x289
? rdinit_setup+0x34/0x34
pci_iommu_init+0x12/0x3a
do_one_initcall+0x65/0x320
? rdinit_setup+0x34/0x34
? rcu_read_lock_sched_held+0x5a/0x80
kernel_init_freeable+0x28a/0x2f3
? rest_init+0x1b0/0x1b0
kernel_init+0x1a/0x130
ret_from_fork+0x1f/0x30
</TASK>
This rolls back dmar_global_lock to rcu_lock in get_resv_regions to avoid
the lockdep splat.
Fixes:
57365a04c921 ("iommu: Move bus setup to IOMMU device registration")
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20220927053109.4053662-3-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Lu Baolu [Wed, 19 Oct 2022 00:44:44 +0000 (08:44 +0800)]
iommu: Add gfp parameter to iommu_alloc_resv_region
Add gfp parameter to iommu_alloc_resv_region() for the callers to specify
the memory allocation behavior. Thus iommu_alloc_resv_region() could also
be available in critical contexts.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20220927053109.4053662-2-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Anup Patel [Fri, 21 Oct 2022 06:22:45 +0000 (11:52 +0530)]
RISC-V: KVM: Fix kvm_riscv_vcpu_timer_pending() for Sstc
The kvm_riscv_vcpu_timer_pending() checks per-VCPU next_cycles
and per-VCPU software injected VS timer interrupt. This function
returns incorrect value when Sstc is available because the per-VCPU
next_cycles are only updated by kvm_riscv_vcpu_timer_save() called
from kvm_arch_vcpu_put(). As a result, when Sstc is available the
VCPU does not block properly upon WFI traps.
To fix the above issue, we introduce kvm_riscv_vcpu_timer_sync()
which will update per-VCPU next_cycles upon every VM exit instead
of kvm_riscv_vcpu_timer_save().
Fixes:
8f5cb44b1bae ("RISC-V: KVM: Support sstc extension")
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Anup Patel <anup@brainfault.org>
Andrew Jones [Fri, 21 Oct 2022 06:22:39 +0000 (11:52 +0530)]
RISC-V: Fix compilation without RISCV_ISA_ZICBOM
riscv_cbom_block_size and riscv_init_cbom_blocksize() should always
be available and riscv_init_cbom_blocksize() should always be
invoked, even when compiling without RISCV_ISA_ZICBOM enabled. This
is because disabling RISCV_ISA_ZICBOM means "don't use zicbom
instructions in the kernel" not "pretend there isn't zicbom, even
when there is". When zicbom is available, whether the kernel enables
its use with RISCV_ISA_ZICBOM or not, KVM will offer it to guests.
Ensure we can build KVM and that the block size is initialized even
when compiling without RISCV_ISA_ZICBOM.
Fixes:
8f7e001e0325 ("RISC-V: Clean up the Zicbom block size probing")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Heiko Stuebner <heiko@sntech.de>
Tested-by: Heiko Stuebner <heiko@sntech.de>
Signed-off-by: Anup Patel <anup@brainfault.org>
Adam Borowski [Mon, 10 Oct 2022 18:33:51 +0000 (20:33 +0200)]
i2c: mlxbf: depend on ACPI; clean away ifdeffage
This fixes maybe_unused warnings/errors.
According to a comment during device tree removal, only ACPI is supported,
thus let's actually require it.
Fixes:
be18c5ede25d ("i2c: mlxbf: remove device tree support")
Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Signed-off-by: Wolfram Sang <wsa@kernel.org>