platform/kernel/linux-rpi.git
3 years agoMerge tag 'nvme-5.15-2021-10-28' of git://git.infradead.org/nvme into block-5.15
Jens Axboe [Thu, 28 Oct 2021 14:34:01 +0000 (08:34 -0600)]
Merge tag 'nvme-5.15-2021-10-28' of git://git.infradead.org/nvme into block-5.15

Pull NVMe fixes from Christoph:

"nvme fixe for Linux 5.15

 - fix nvmet-tcp header digest verification (Amit Engel)
 - fix a memory leak in nvmet-tcp when releasing a queue
   (Maurizio Lombardi)
 - fix nvme-tcp H2CData PDU send accounting again (Sagi Grimberg)
 - fix digest pointer calculation in nvme-tcp and nvmet-tcp
   (Varun Prakash)
 - fix possible nvme-tcp req->offset corruption (Varun Prakash)"

* tag 'nvme-5.15-2021-10-28' of git://git.infradead.org/nvme:
  nvmet-tcp: fix header digest verification
  nvmet-tcp: fix data digest pointer calculation
  nvme-tcp: fix data digest pointer calculation
  nvme-tcp: fix possible req->offset corruption
  nvme-tcp: fix H2CData PDU send accounting (again)
  nvmet-tcp: fix a memory leak when releasing a queue

3 years agoblock: Fix partition check for host-aware zoned block devices
Shin'ichiro Kawasaki [Tue, 26 Oct 2021 06:01:15 +0000 (15:01 +0900)]
block: Fix partition check for host-aware zoned block devices

Commit a33df75c6328 ("block: use an xarray for disk->part_tbl") modified
the method to check partition existence in host-aware zoned block
devices from disk_has_partitions() helper function call to empty check
of xarray disk->part_tbl. However, disk->part_tbl always has single
entry for disk->part0 and never becomes empty. This resulted in the
host-aware zoned devices always judged to have partitions, and it made
the sysfs queue/zoned attribute to be "none" instead of "host-aware"
regardless of partition existence in the devices.

This also caused DEBUG_LOCKS_WARN_ON(lock->magic != lock) for
sdkp->rev_mutex in scsi layer when the kernel detects host-aware zoned
device. Since block layer handled the host-aware zoned devices as non-
zoned devices, scsi layer did not have chance to initialize the mutex
for zone revalidation. Therefore, the warning was triggered.

To fix the issues, call the helper function disk_has_partitions() in
place of disk->part_tbl empty check. Since the function was removed with
the commit a33df75c6328, reimplement it to walk through entries in the
xarray disk->part_tbl.

Fixes: a33df75c6328 ("block: use an xarray for disk->part_tbl")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: stable@vger.kernel.org # v5.14+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211026060115.753746-1-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonvmet-tcp: fix header digest verification
Amit Engel [Wed, 27 Oct 2021 06:49:27 +0000 (09:49 +0300)]
nvmet-tcp: fix header digest verification

Pass the correct length to nvmet_tcp_verify_hdgst, which is the pdu
header length.  This fixes a wrong behaviour where header digest
verification passes although the digest is wrong.

Signed-off-by: Amit Engel <amit.engel@dell.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvmet-tcp: fix data digest pointer calculation
Varun Prakash [Mon, 25 Oct 2021 17:16:54 +0000 (22:46 +0530)]
nvmet-tcp: fix data digest pointer calculation

exp_ddgst is of type __le32, &cmd->exp_ddgst + cmd->offset increases
&cmd->exp_ddgst by 4 * cmd->offset, fix this by type casting
&cmd->exp_ddgst to u8 *.

Fixes: 872d26a391da ("nvmet-tcp: add NVMe over TCP target driver")
Signed-off-by: Varun Prakash <varun@chelsio.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme-tcp: fix data digest pointer calculation
Varun Prakash [Mon, 25 Oct 2021 17:17:30 +0000 (22:47 +0530)]
nvme-tcp: fix data digest pointer calculation

ddgst is of type __le32, &req->ddgst + req->offset
increases &req->ddgst by 4 * req->offset, fix this by
type casting &req->ddgst to u8 *.

Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
Signed-off-by: Varun Prakash <varun@chelsio.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme-tcp: fix possible req->offset corruption
Varun Prakash [Tue, 26 Oct 2021 13:31:55 +0000 (19:01 +0530)]
nvme-tcp: fix possible req->offset corruption

With commit db5ad6b7f8cd ("nvme-tcp: try to send request in queue_rq
context") r2t and response PDU can get processed while send function
is executing.

Current data digest send code uses req->offset after kernel_sendmsg(),
this creates a race condition where req->offset gets reset before it
is used in send function.

This can happen in two cases -
1. Target sends r2t PDU which resets req->offset.
2. Target send response PDU which completes the req and then req is
   used for a new command, nvme_tcp_setup_cmd_pdu() resets req->offset.

Fix this by storing req->offset in a local variable and using
this local variable after kernel_sendmsg().

Fixes: db5ad6b7f8cd ("nvme-tcp: try to send request in queue_rq context")
Signed-off-by: Varun Prakash <varun@chelsio.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agoblock: schedule queue restart after BLK_STS_ZONE_RESOURCE
Naohiro Aota [Tue, 26 Oct 2021 16:51:27 +0000 (01:51 +0900)]
block: schedule queue restart after BLK_STS_ZONE_RESOURCE

When dispatching a zone append write request to a SCSI zoned block device,
if the target zone of the request is already locked, the device driver will
return BLK_STS_ZONE_RESOURCE and the request will be pushed back to the
hctx dipatch queue. The queue will be marked as RESTART in
dd_finish_request() and restarted in __blk_mq_free_request(). However, this
restart applies to the hctx of the completed request. If the requeued
request is on a different hctx, dispatch will no be retried until another
request is submitted or the next periodic queue run triggers, leading to up
to 30 seconds latency for the requeued request.

Fix this problem by scheduling a queue restart similarly to the
BLK_STS_RESOURCE case or when we cannot get the budget.

Also, consolidate the checks into the "need_resource" variable to simplify
the condition.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
Link: https://lore.kernel.org/r/20211026165127.4151055-1-naohiro.aota@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: drain queue after disk is removed from sysfs
Ming Lei [Tue, 26 Oct 2021 10:12:04 +0000 (18:12 +0800)]
block: drain queue after disk is removed from sysfs

Before removing disk from sysfs, userspace still may change queue via
sysfs, such as switching elevator or setting wbt latency, both may
reinitialize wbt, then the warning in blk_free_queue_stats() will be
triggered since rq_qos_exit() is moved to del_gendisk().

Fixes the issue by moving draining queue & tearing down after disk is
removed from sysfs, at that time no one can come into queue's
store()/show().

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Fixes: 8e141f9eb803 ("block: drain file system I/O on del_gendisk")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211026101204.2897166-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonvme-tcp: fix H2CData PDU send accounting (again)
Sagi Grimberg [Sun, 24 Oct 2021 07:43:31 +0000 (10:43 +0300)]
nvme-tcp: fix H2CData PDU send accounting (again)

We should not access request members after the last send, even to
determine if indeed it was the last data payload send. The reason is
that a completion could have arrived and trigger a new execution of the
request which overridden these members. This was fixed by commit
825619b09ad3 ("nvme-tcp: fix possible use-after-completion").

Commit e371af033c56 broke that assumption again to address cases where
multiple r2t pdus are sent per request. To fix it, we need to record the
request data_sent and data_len and after the payload network send we
reference these counters to determine weather we should advance the
request iterator.

Fixes: e371af033c56 ("nvme-tcp: fix incorrect h2cdata pdu offset accounting")
Reported-by: Keith Busch <kbusch@kernel.org>
Cc: stable@vger.kernel.org # 5.10+
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvmet-tcp: fix a memory leak when releasing a queue
Maurizio Lombardi [Fri, 15 Oct 2021 08:26:34 +0000 (10:26 +0200)]
nvmet-tcp: fix a memory leak when releasing a queue

page_frag_free() won't completely release the memory
allocated for the commands, the cache page must be explicitly
freed by calling __page_frag_cache_drain().

This bug can be easily reproduced by repeatedly
executing the following command on the initiator:

$echo 1 > /sys/devices/virtual/nvme-fabrics/ctl/nvme0/reset_controller

Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: John Meneghini <jmeneghi@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agoblock: fix incorrect references to disk objects
Zqiang [Mon, 18 Oct 2021 10:34:22 +0000 (18:34 +0800)]
block: fix incorrect references to disk objects

When adding partitions to the disk, the reference count of the disk
object is increased. then alloc partition device and called
device_add(), if the device_add() return error, the reference
count of the disk object will be reduced twice, at put_device(pdev)
and put_disk(disk). this leads to the end of the object's life cycle
prematurely, and trigger following calltrace.

  __init_work+0x2d/0x50 kernel/workqueue.c:519
  synchronize_rcu_expedited+0x3af/0x650 kernel/rcu/tree_exp.h:847
  bdi_remove_from_list mm/backing-dev.c:938 [inline]
  bdi_unregister+0x17f/0x5c0 mm/backing-dev.c:946
  release_bdi+0xa1/0xc0 mm/backing-dev.c:968
  kref_put include/linux/kref.h:65 [inline]
  bdi_put+0x72/0xa0 mm/backing-dev.c:976
  bdev_free_inode+0x11e/0x220 block/bdev.c:408
  i_callback+0x3f/0x70 fs/inode.c:226
  rcu_do_batch kernel/rcu/tree.c:2508 [inline]
  rcu_core+0x76d/0x16c0 kernel/rcu/tree.c:2743
  __do_softirq+0x1d7/0x93b kernel/softirq.c:558
  invoke_softirq kernel/softirq.c:432 [inline]
  __irq_exit_rcu kernel/softirq.c:636 [inline]
  irq_exit_rcu+0xf2/0x130 kernel/softirq.c:648
  sysvec_apic_timer_interrupt+0x93/0xc0

making disk is NULL when calling put_disk().

Reported-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211018103422.2043-1-qiang.zhang1211@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-cgroup: blk_cgroup_bio_start() should use irq-safe operations on blkg->iostat_cpu
Tejun Heo [Thu, 14 Oct 2021 23:20:22 +0000 (13:20 -1000)]
blk-cgroup: blk_cgroup_bio_start() should use irq-safe operations on blkg->iostat_cpu

c3df5fb57fe8 ("cgroup: rstat: fix A-A deadlock on 32bit around
u64_stats_sync") made u64_stats updates irq-safe to avoid A-A deadlocks.
Unfortunately, the conversion missed one in blk_cgroup_bio_start(). Fix it.

Fixes: 2d146aa3aa84 ("mm: memcontrol: switch to rstat")
Cc: stable@vger.kernel.org # v5.13+
Reported-by: syzbot+9738c8815b375ce482a1@syzkaller.appspotmail.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/YWi7NrQdVlxD6J9W@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock, bfq: reset last_bfqq_created on group change
Paolo Valente [Fri, 15 Oct 2021 14:43:36 +0000 (16:43 +0200)]
block, bfq: reset last_bfqq_created on group change

Since commit 430a67f9d616 ("block, bfq: merge bursts of newly-created
queues"), BFQ maintains a per-group pointer to the last bfq_queue
created. If such a queue, say bfqq, happens to move to a different
group, then bfqq is no more a valid last bfq_queue created for its
previous group. That pointer must then be cleared. Not resetting such
a pointer may also cause UAF, if bfqq happens to also be freed after
being moved to a different group. This commit performs this missing
reset. As such it fixes commit 430a67f9d616 ("block, bfq: merge bursts
of newly-created queues").

Such a missing reset is most likely the cause of the crash reported in [1].
With some analysis, we found that this crash was due to the
above UAF. And such UAF did go away with this commit applied [1].

Anyway, before this commit, that crash happened to be triggered in
conjunction with commit 2d52c58b9c9b ("block, bfq: honor already-setup
queue merges"). The latter was then reverted by commit ebc69e897e17
("Revert "block, bfq: honor already-setup queue merges""). Yet commit
2d52c58b9c9b ("block, bfq: honor already-setup queue merges") contains
no error related with the above UAF, and can then be restored.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=214503

Fixes: 430a67f9d616 ("block, bfq: merge bursts of newly-created queues")
Tested-by: Grzegorz Kowal <custos.mentis@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20211015144336.45894-2-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: warn when putting the final reference on a registered disk
Christoph Hellwig [Thu, 14 Oct 2021 13:02:31 +0000 (15:02 +0200)]
block: warn when putting the final reference on a registered disk

Warn when the last reference on a live disk is put without calling
del_gendisk first.  There are some BDI related bug reports that look
like a case of this, so make sure we have the proper instrumentation
to catch it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211014130231.1468538-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agobrd: reduce the brd_devices_mutex scope
Tetsuo Handa [Tue, 7 Sep 2021 10:18:00 +0000 (19:18 +0900)]
brd: reduce the brd_devices_mutex scope

As with commit 8b52d8be86d72308 ("loop: reorder loop_exit"),
unregister_blkdev() needs to be called first in order to avoid calling
brd_alloc() from brd_probe() after brd_del_one() from brd_exit(). Then,
we can avoid holding global mutex during add_disk()/del_gendisk() as with
commit 1c500ad706383f1a ("loop: reduce the loop_ctl_mutex scope").

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/e205f13d-18ff-a49c-0988-7de6ea5ff823@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agokyber: avoid q->disk dereferences in trace points
Christoph Hellwig [Tue, 12 Oct 2021 09:33:01 +0000 (11:33 +0200)]
kyber: avoid q->disk dereferences in trace points

q->disk becomes invalid after the gendisk is removed.  Work around this
by caching the dev_t for the tracepoints.  The real fix would be to
properly tear down the I/O schedulers with the gendisk, but that is
a much more invasive change.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012093301.GA27795@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: keep q_usage_counter in atomic mode after del_gendisk
Christoph Hellwig [Wed, 29 Sep 2021 07:12:41 +0000 (09:12 +0200)]
block: keep q_usage_counter in atomic mode after del_gendisk

Don't switch back to percpu mode to avoid the double RCU grace period
when tearing down SCSI devices.  After removing the disk only passthrough
commands can be send anyway.

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210929071241.934472-6-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: drain file system I/O on del_gendisk
Christoph Hellwig [Wed, 29 Sep 2021 07:12:40 +0000 (09:12 +0200)]
block: drain file system I/O on del_gendisk

Instead of delaying draining of file system I/O related items like the
blk-qos queues, the integrity read workqueue and timeouts only when the
request_queue is removed, do that when del_gendisk is called.  This is
important for SCSI where the upper level drivers that control the gendisk
are separate entities, and the disk can be freed much earlier than the
request_queue, or can even be unbound without tearing down the queue.

Fixes: edb0872f44ec ("block: move the bdi from the request_queue to the gendisk")
Reported-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210929071241.934472-5-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: split bio_queue_enter from blk_queue_enter
Christoph Hellwig [Wed, 29 Sep 2021 07:12:39 +0000 (09:12 +0200)]
block: split bio_queue_enter from blk_queue_enter

To prepare for fixing a gendisk shutdown race, open code the
blk_queue_enter logic in bio_queue_enter.  This also removes the
pointless flags translation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210929071241.934472-4-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: factor out a blk_try_enter_queue helper
Christoph Hellwig [Wed, 29 Sep 2021 07:12:38 +0000 (09:12 +0200)]
block: factor out a blk_try_enter_queue helper

Factor out the code to try to get q_usage_counter without blocking into
a separate helper.  Both to improve code readability and to prepare for
splitting bio_queue_enter from blk_queue_enter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210929071241.934472-3-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: call submit_bio_checks under q_usage_counter
Christoph Hellwig [Wed, 29 Sep 2021 07:12:37 +0000 (09:12 +0200)]
block: call submit_bio_checks under q_usage_counter

Ensure all bios check the current values of the queue under freeze
protection, i.e. to make sure the zero capacity set by del_gendisk
is actually seen before dispatching to the driver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210929071241.934472-2-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoMerge tag 'nvme-5.15-2021-10-14' of git://git.infradead.org/nvme into block-5.15
Jens Axboe [Thu, 14 Oct 2021 15:07:14 +0000 (09:07 -0600)]
Merge tag 'nvme-5.15-2021-10-14' of git://git.infradead.org/nvme into block-5.15

Pull NVMe fixes from Christoph:

"nvme fixes for Linux 5.15:

 - fix the abort command id (Keith Busch)
 - nvme: fix per-namespace chardev deletion (Adam Manzanares)"

* tag 'nvme-5.15-2021-10-14' of git://git.infradead.org/nvme:
  nvme: fix per-namespace chardev deletion
  nvme-pci: Fix abort command id

3 years agonvme: fix per-namespace chardev deletion
Adam Manzanares [Wed, 13 Oct 2021 15:04:19 +0000 (15:04 +0000)]
nvme: fix per-namespace chardev deletion

Decrease reference count of chardevice during char device deletion in
order to fix a memory leak.  Add a release callabck for the device
associated chardev and move ida_simple_remove into the release function.

Fixes: 2637baed7801 ("nvme: introduce generic per-namespace chardev")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Suggested-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Javier González <javier@javigon.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agoblock/rnbd-clt-sysfs: fix a couple uninitialized variable bugs
Dan Carpenter [Tue, 12 Oct 2021 08:44:43 +0000 (11:44 +0300)]
block/rnbd-clt-sysfs: fix a couple uninitialized variable bugs

These variables are printed on the error path if match_int() fails so
they have to be initialized.

Fixes: 2958a995edc9 ("block/rnbd-clt: Support polling mode for IO latency optimization")
Fixes: 1eb54f8f5dd8 ("block/rnbd: client: sysfs interface functions")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Gioh Kim <gi-oh.kim@ionos.com>
Link: https://lore.kernel.org/r/20211012084443.GA31472@kili
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonvme-pci: Fix abort command id
Keith Busch [Thu, 7 Oct 2021 06:50:31 +0000 (23:50 -0700)]
nvme-pci: Fix abort command id

The request tag is no longer the only component of the command id.

Fixes: e7006de6c2380 ("nvme: code command_id with a genctr for use-after-free validation")
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
3 years agoblock: decode QUEUE_FLAG_HCTX_ACTIVE in debugfs output
Johannes Thumshirn [Mon, 4 Oct 2021 07:22:07 +0000 (16:22 +0900)]
block: decode QUEUE_FLAG_HCTX_ACTIVE in debugfs output

While debugging an issue we've found that $DEBUGFS/block/$disk/state
doesn't decode QUEUE_FLAG_HCTX_ACTIVE but only displays its numerical
value.

Add QUEUE_FLAG(HCTX_ACTIVE) to the blk_queue_flag_name array so it'll get
decoded properly.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/4351076388918075bd80ef07756f9d2ce63be12c.1633332053.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: genhd: fix double kfree() in __alloc_disk_node()
Tetsuo Handa [Sat, 2 Oct 2021 09:23:02 +0000 (18:23 +0900)]
block: genhd: fix double kfree() in __alloc_disk_node()

syzbot is reporting use-after-free read at bdev_free_inode() [1], for
kfree() from __alloc_disk_node() is called before bdev_free_inode()
(which is called after RCU grace period) reads bdev->bd_disk and calls
kfree(bdev->bd_disk).

Fix use-after-free read followed by double kfree() problem
by making sure that bdev->bd_disk is NULL when calling iput().

Link: https://syzkaller.appspot.com/bug?extid=8281086e8a6fbfbd952a
Reported-by: syzbot <syzbot+8281086e8a6fbfbd952a@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/e6dd13c5-8db0-4392-6e78-a42ee5d2a1c4@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonbd: use shifts rather than multiplies
Nick Desaulniers [Mon, 20 Sep 2021 23:25:33 +0000 (16:25 -0700)]
nbd: use shifts rather than multiplies

commit fad7cd3310db ("nbd: add the check to prevent overflow in
__nbd_ioctl()") raised an issue from the fallback helpers added in
commit f0907827a8a9 ("compiler.h: enable builtin overflow checkers and
add fallback code")

ERROR: modpost: "__divdi3" [drivers/block/nbd.ko] undefined!

As Stephen Rothwell notes:
  The added check_mul_overflow() call is being passed 64 bit values.
  COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW is not set for this build (see
  include/linux/overflow.h).

Specifically, the helpers for checking whether the results of a
multiplication overflowed (__unsigned_mul_overflow,
__signed_add_overflow) use the division operator when
!COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW.  This is problematic for 64b
operands on 32b hosts.

This was fixed upstream by
commit 76ae847497bc ("Documentation: raise minimum supported version of
GCC to 5.1")
which is not suitable to be backported to stable.

Further, __builtin_mul_overflow() would emit a libcall to a
compiler-rt-only symbol when compiling with clang < 14 for 32b targets.

ld.lld: error: undefined symbol: __mulodi4

In order to keep stable buildable with GCC 4.9 and clang < 14, modify
struct nbd_config to instead track the number of bits of the block size;
reconstructing the block size using runtime checked shifts that are not
problematic for those compilers and in a ways that can be backported to
stable.

In nbd_set_size, we do validate that the value of blksize must be a
power of two (POT) and is in the range of [512, PAGE_SIZE] (both
inclusive).

This does modify the debugfs interface.

Cc: stable@vger.kernel.org
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Link: https://github.com/ClangBuiltLinux/linux/issues/1438
Link: https://lore.kernel.org/all/20210909182525.372ee687@canb.auug.org.au/
Link: https://lore.kernel.org/stable/CAHk-=whiQBofgis_rkniz8GBP9wZtSZdcDEffgSLO62BUGV3gg@mail.gmail.com/
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Suggested-by: Kees Cook <keescook@chromium.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20210920232533.4092046-1-ndesaulniers@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoRevert "block, bfq: honor already-setup queue merges"
Jens Axboe [Tue, 28 Sep 2021 12:33:15 +0000 (06:33 -0600)]
Revert "block, bfq: honor already-setup queue merges"

This reverts commit 2d52c58b9c9bdae0ca3df6a1eab5745ab3f7d80b.

We have had several folks complain that this causes hangs for them, which
is especially problematic as the commit has also hit stable already.

As no resolution seems to be forthcoming right now, revert the patch.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=214503
Fixes: 2d52c58b9c9b ("block, bfq: honor already-setup queue merges")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonvme: add command id quirk for apple controllers
Keith Busch [Mon, 27 Sep 2021 15:43:06 +0000 (08:43 -0700)]
nvme: add command id quirk for apple controllers

Some apple controllers use the command id as an index to implementation
specific data structures and will fail if the value is out of bounds.
The nvme driver's recently introduced command sequence number breaks
this controller.

Provide a quirk so these spec incompliant controllers can function as
before. The driver will not have the ability to detect bad completions
when this quirk is used, but we weren't previously checking this anyway.

The quirk bit was selected so that it can readily apply to stable.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=214509
Cc: Sven Peter <sven@svenpeter.dev>
Reported-by: Orlando Chamberlain <redecorating@protonmail.com>
Reported-by: Aditya Garg <gargaditya08@live.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sven Peter <sven@svenpeter.dev>
Link: https://lore.kernel.org/r/20210927154306.387437-1-kbusch@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: hold ->invalidate_lock in blkdev_fallocate
Ming Lei [Thu, 23 Sep 2021 02:37:51 +0000 (10:37 +0800)]
block: hold ->invalidate_lock in blkdev_fallocate

When running ->fallocate(), blkdev_fallocate() should hold
mapping->invalidate_lock to prevent page cache from being accessed,
otherwise stale data may be read in page cache.

Without this patch, blktests block/009 fails sometimes. With this patch,
block/009 can pass always.

Also as Jan pointed out, no pages can be created in the discarded area
while you are holding the invalidate_lock, so remove the 2nd
truncate_bdev_range().

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210923023751.1441091-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblktrace: Fix uaf in blk_trace access after removing by sysfs
Zhihao Cheng [Thu, 23 Sep 2021 13:49:21 +0000 (21:49 +0800)]
blktrace: Fix uaf in blk_trace access after removing by sysfs

There is an use-after-free problem triggered by following process:

      P1(sda) P2(sdb)
echo 0 > /sys/block/sdb/trace/enable
  blk_trace_remove_queue
    synchronize_rcu
    blk_trace_free
      relay_close
rcu_read_lock
__blk_add_trace
  trace_note_tsk
  (Iterate running_trace_list)
        relay_close_buf
  relay_destroy_buf
    kfree(buf)
    trace_note(sdb's bt)
      relay_reserve
        buf->offset <- nullptr deference (use-after-free) !!!
rcu_read_unlock

[  502.714379] BUG: kernel NULL pointer dereference, address:
0000000000000010
[  502.715260] #PF: supervisor read access in kernel mode
[  502.715903] #PF: error_code(0x0000) - not-present page
[  502.716546] PGD 103984067 P4D 103984067 PUD 17592b067 PMD 0
[  502.717252] Oops: 0000 [#1] SMP
[  502.720308] RIP: 0010:trace_note.isra.0+0x86/0x360
[  502.732872] Call Trace:
[  502.733193]  __blk_add_trace.cold+0x137/0x1a3
[  502.733734]  blk_add_trace_rq+0x7b/0xd0
[  502.734207]  blk_add_trace_rq_issue+0x54/0xa0
[  502.734755]  blk_mq_start_request+0xde/0x1b0
[  502.735287]  scsi_queue_rq+0x528/0x1140
...
[  502.742704]  sg_new_write.isra.0+0x16e/0x3e0
[  502.747501]  sg_ioctl+0x466/0x1100

Reproduce method:
  ioctl(/dev/sda, BLKTRACESETUP, blk_user_trace_setup[buf_size=127])
  ioctl(/dev/sda, BLKTRACESTART)
  ioctl(/dev/sdb, BLKTRACESETUP, blk_user_trace_setup[buf_size=127])
  ioctl(/dev/sdb, BLKTRACESTART)

  echo 0 > /sys/block/sdb/trace/enable &
  // Add delay(mdelay/msleep) before kernel enters blk_trace_free()

  ioctl$SG_IO(/dev/sda, SG_IO, ...)
  // Enters trace_note_tsk() after blk_trace_free() returned
  // Use mdelay in rcu region rather than msleep(which may schedule out)

Remove blk_trace from running_list before calling blk_trace_free() by
sysfs if blk_trace is at Blktrace_running state.

Fixes: c71a896154119f ("blktrace: add ftrace plugin")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Link: https://lore.kernel.org/r/20210923134921.109194-1-chengzhihao1@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: don't call rq_qos_ops->done_bio if the bio isn't tracked
Ming Lei [Fri, 24 Sep 2021 11:07:04 +0000 (19:07 +0800)]
block: don't call rq_qos_ops->done_bio if the bio isn't tracked

rq_qos framework is only applied on request based driver, so:

1) rq_qos_done_bio() needn't to be called for bio based driver

2) rq_qos_done_bio() needn't to be called for bio which isn't tracked,
such as bios ended from error handling code.

Especially in bio_endio():

1) request queue is referred via bio->bi_bdev->bd_disk->queue, which
may be gone since request queue refcount may not be held in above two
cases

2) q->rq_qos may be freed in blk_cleanup_queue() when calling into
__rq_qos_done_bio()

Fix the potential kernel panic by not calling rq_qos_ops->done_bio if
the bio isn't tracked. This way is safe because both ioc_rqos_done_bio()
and blkcg_iolatency_done_bio() are nop if the bio isn't tracked.

Reported-by: Yu Kuai <yukuai3@huawei.com>
Cc: tj@kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210924110704.1541818-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoMerge tag 'nvme-5.15-2021-09-24' of git://git.infradead.org/nvme into block-5.15
Jens Axboe [Fri, 24 Sep 2021 13:15:21 +0000 (07:15 -0600)]
Merge tag 'nvme-5.15-2021-09-24' of git://git.infradead.org/nvme into block-5.15

Pull NVMe fixes from Christoph:

"nvme fixes for Linux 5.15:

 - keep ctrl->namespaces ordered (me)
 - fix incorrect h2cdata pdu offset accounting in nvme-tcp
   (Sagi Grimberg)
 - handled updated hw_queues in nvme-fc more carefully (Daniel Wagner,
   James Smart)"

* tag 'nvme-5.15-2021-09-24' of git://git.infradead.org/nvme:
  nvme: keep ctrl->namespaces ordered
  nvme-tcp: fix incorrect h2cdata pdu offset accounting
  nvme-fc: remove freeze/unfreeze around update_nr_hw_queues
  nvme-fc: avoid race between time out and tear down
  nvme-fc: update hardware queues before using them

3 years agoMerge branch 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md...
Jens Axboe [Wed, 22 Sep 2021 18:43:18 +0000 (12:43 -0600)]
Merge branch 'md-fixes' of https://git./linux/kernel/git/song/md into block-5.15

Pull MD fix from Song.

* 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  md: fix a lock order reversal in md_alloc

3 years agomd: fix a lock order reversal in md_alloc
Christoph Hellwig [Wed, 1 Sep 2021 11:38:29 +0000 (13:38 +0200)]
md: fix a lock order reversal in md_alloc

Commit b0140891a8cea3 ("md: Fix race when creating a new md device.")
not only moved assigning mddev->gendisk before calling add_disk, which
fixes the races described in the commit log, but also added a
mddev->open_mutex critical section over add_disk and creation of the
md kobj.  Adding a kobject after add_disk is racy vs deleting the gendisk
right after adding it, but md already prevents against that by holding
a mddev->active reference.

On the other hand taking this lock added a lock order reversal with what
is not disk->open_mutex (used to be bdev->bd_mutex when the commit was
added) for partition devices, which need that lock for the internal open
for the partition scan, and a recent commit also takes it for
non-partitioned devices, leading to further lockdep splatter.

Fixes: b0140891a8ce ("md: Fix race when creating a new md device.")
Fixes: d62633873590 ("block: support delayed holder registration")
Reported-by: syzbot+fadc0aaf497e6a493b9f@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: syzbot+fadc0aaf497e6a493b9f@syzkaller.appspotmail.com
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
3 years agonvme: keep ctrl->namespaces ordered
Christoph Hellwig [Tue, 14 Sep 2021 06:38:20 +0000 (08:38 +0200)]
nvme: keep ctrl->namespaces ordered

Various places in the nvme code that rely on ctrl->namespace to be
ordered.  Ensure that the namespae is inserted into the list at the
right position from the start instead of sorting it after the fact.

Fixes: 540c801c65eb ("NVMe: Implement namespace list scanning")
Reported-by: Anton Eidelman <anton.eidelman@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
3 years agonvme-tcp: fix incorrect h2cdata pdu offset accounting
Sagi Grimberg [Tue, 14 Sep 2021 15:38:55 +0000 (18:38 +0300)]
nvme-tcp: fix incorrect h2cdata pdu offset accounting

When the controller sends us multiple r2t PDUs in a single
request we need to account for it correctly as our send/recv
context run concurrently (i.e. we get a new r2t with r2t_offset
before we updated our iterator and req->data_sent marker). This
can cause wrong offsets to be sent to the controller.

To fix that, we will first know that this may happen only in
the send sequence of the last page, hence we will take
the r2t_offset to the h2c PDU data_offset, and in
nvme_tcp_try_send_data loop, we make sure to increment
the request markers also when we completed a PDU but
we are expecting more r2t PDUs as we still did not send
the entire data of the request.

Fixes: 825619b09ad3 ("nvme-tcp: fix possible use-after-completion")
Reported-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com>
Tested-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme-fc: remove freeze/unfreeze around update_nr_hw_queues
James Smart [Tue, 14 Sep 2021 09:20:08 +0000 (11:20 +0200)]
nvme-fc: remove freeze/unfreeze around update_nr_hw_queues

Remove the freeze/unfreeze around changes to the number of hardware
queues. Study and retest has indicated there are no ios that can be
active at this point so there is nothing to freeze.

nvme-fc is draining the queues in the shutdown and error recovery path
in __nvme_fc_abort_outstanding_ios.

This patch primarily reverts 88e837ed0f1f "nvme-fc: wait for queues to
freeze before calling update_hr_hw_queues". It's not an exact revert as
it leaves the adjusting of hw queues only if the count changes.

Signed-off-by: James Smart <jsmart2021@gmail.com>
[dwagner: added explanation why no IO is pending]
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme-fc: avoid race between time out and tear down
James Smart [Tue, 14 Sep 2021 09:20:07 +0000 (11:20 +0200)]
nvme-fc: avoid race between time out and tear down

To avoid race between time out and tear down, in tear down process,
first we quiesce the queue, and then delete the timer and cancel
the time out work for the queue.

This patch merges the admin and io sync ops into the queue teardown logic
as shown in the RDMA patch 3017013dcc "nvme-rdma: avoid race between time
out and tear down". There is no teardown_lock in nvme-fc.

Signed-off-by: James Smart <jsmart2021@gmail.com>
Tested-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme-fc: update hardware queues before using them
Daniel Wagner [Tue, 14 Sep 2021 09:20:06 +0000 (11:20 +0200)]
nvme-fc: update hardware queues before using them

In case the number of hardware queues changes, we need to update the
tagset and the mapping of ctx to hctx first.

If we try to create and connect the I/O queues first, this operation
will fail (target will reject the connect call due to the wrong number
of queues) and hence we bail out of the recreate function. Then we
will to try the very same operation again, thus we don't make any
progress.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agoblk-cgroup: fix UAF by grabbing blkcg lock before destroying blkg pd
Li Jinlin [Tue, 14 Sep 2021 04:26:05 +0000 (12:26 +0800)]
blk-cgroup: fix UAF by grabbing blkcg lock before destroying blkg pd

KASAN reports a use-after-free report when doing fuzz test:

[693354.104835] ==================================================================
[693354.105094] BUG: KASAN: use-after-free in bfq_io_set_weight_legacy+0xd3/0x160
[693354.105336] Read of size 4 at addr ffff888be0a35664 by task sh/1453338

[693354.105607] CPU: 41 PID: 1453338 Comm: sh Kdump: loaded Not tainted 4.18.0-147
[693354.105610] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.81 07/02/2018
[693354.105612] Call Trace:
[693354.105621]  dump_stack+0xf1/0x19b
[693354.105626]  ? show_regs_print_info+0x5/0x5
[693354.105634]  ? printk+0x9c/0xc3
[693354.105638]  ? cpumask_weight+0x1f/0x1f
[693354.105648]  print_address_description+0x70/0x360
[693354.105654]  kasan_report+0x1b2/0x330
[693354.105659]  ? bfq_io_set_weight_legacy+0xd3/0x160
[693354.105665]  ? bfq_io_set_weight_legacy+0xd3/0x160
[693354.105670]  bfq_io_set_weight_legacy+0xd3/0x160
[693354.105675]  ? bfq_cpd_init+0x20/0x20
[693354.105683]  cgroup_file_write+0x3aa/0x510
[693354.105693]  ? ___slab_alloc+0x507/0x540
[693354.105698]  ? cgroup_file_poll+0x60/0x60
[693354.105702]  ? 0xffffffff89600000
[693354.105708]  ? usercopy_abort+0x90/0x90
[693354.105716]  ? mutex_lock+0xef/0x180
[693354.105726]  kernfs_fop_write+0x1ab/0x280
[693354.105732]  ? cgroup_file_poll+0x60/0x60
[693354.105738]  vfs_write+0xe7/0x230
[693354.105744]  ksys_write+0xb0/0x140
[693354.105749]  ? __ia32_sys_read+0x50/0x50
[693354.105760]  do_syscall_64+0x112/0x370
[693354.105766]  ? syscall_return_slowpath+0x260/0x260
[693354.105772]  ? do_page_fault+0x9b/0x270
[693354.105779]  ? prepare_exit_to_usermode+0xf9/0x1a0
[693354.105784]  ? enter_from_user_mode+0x30/0x30
[693354.105793]  entry_SYSCALL_64_after_hwframe+0x65/0xca

[693354.105875] Allocated by task 1453337:
[693354.106001]  kasan_kmalloc+0xa0/0xd0
[693354.106006]  kmem_cache_alloc_node_trace+0x108/0x220
[693354.106010]  bfq_pd_alloc+0x96/0x120
[693354.106015]  blkcg_activate_policy+0x1b7/0x2b0
[693354.106020]  bfq_create_group_hierarchy+0x1e/0x80
[693354.106026]  bfq_init_queue+0x678/0x8c0
[693354.106031]  blk_mq_init_sched+0x1f8/0x460
[693354.106037]  elevator_switch_mq+0xe1/0x240
[693354.106041]  elevator_switch+0x25/0x40
[693354.106045]  elv_iosched_store+0x1a1/0x230
[693354.106049]  queue_attr_store+0x78/0xb0
[693354.106053]  kernfs_fop_write+0x1ab/0x280
[693354.106056]  vfs_write+0xe7/0x230
[693354.106060]  ksys_write+0xb0/0x140
[693354.106064]  do_syscall_64+0x112/0x370
[693354.106069]  entry_SYSCALL_64_after_hwframe+0x65/0xca

[693354.106114] Freed by task 1453336:
[693354.106225]  __kasan_slab_free+0x130/0x180
[693354.106229]  kfree+0x90/0x1b0
[693354.106233]  blkcg_deactivate_policy+0x12c/0x220
[693354.106238]  bfq_exit_queue+0xf5/0x110
[693354.106241]  blk_mq_exit_sched+0x104/0x130
[693354.106245]  __elevator_exit+0x45/0x60
[693354.106249]  elevator_switch_mq+0xd6/0x240
[693354.106253]  elevator_switch+0x25/0x40
[693354.106257]  elv_iosched_store+0x1a1/0x230
[693354.106261]  queue_attr_store+0x78/0xb0
[693354.106264]  kernfs_fop_write+0x1ab/0x280
[693354.106268]  vfs_write+0xe7/0x230
[693354.106271]  ksys_write+0xb0/0x140
[693354.106275]  do_syscall_64+0x112/0x370
[693354.106280]  entry_SYSCALL_64_after_hwframe+0x65/0xca

[693354.106329] The buggy address belongs to the object at ffff888be0a35580
                 which belongs to the cache kmalloc-1k of size 1024
[693354.106736] The buggy address is located 228 bytes inside of
                 1024-byte region [ffff888be0a35580ffff888be0a35980)
[693354.107114] The buggy address belongs to the page:
[693354.107273] page:ffffea002f828c00 count:1 mapcount:0 mapping:ffff888107c17080 index:0x0 compound_mapcount: 0
[693354.107606] flags: 0x17ffffc0008100(slab|head)
[693354.107760] raw: 0017ffffc0008100 ffffea002fcbc808 ffffea0030bd3a08 ffff888107c17080
[693354.108020] raw: 0000000000000000 00000000001c001c 00000001ffffffff 0000000000000000
[693354.108278] page dumped because: kasan: bad access detected

[693354.108511] Memory state around the buggy address:
[693354.108671]  ffff888be0a35500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[693354.116396]  ffff888be0a35580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[693354.124473] >ffff888be0a35600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[693354.132421]                                                        ^
[693354.140284]  ffff888be0a35680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[693354.147912]  ffff888be0a35700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[693354.155281] ==================================================================

blkgs are protected by both queue and blkcg locks and holding
either should stabilize them. However, the path of destroying
blkg policy data is only protected by queue lock in
blkcg_activate_policy()/blkcg_deactivate_policy(). Other tasks
can get the blkg policy data before the blkg policy data is
destroyed, and use it after destroyed, which will result in a
use-after-free.

CPU0                             CPU1
blkcg_deactivate_policy
  spin_lock_irq(&q->queue_lock)
                                 bfq_io_set_weight_legacy
                                   spin_lock_irq(&blkcg->lock)
                                   blkg_to_bfqg(blkg)
                                     pd_to_bfqg(blkg->pd[pol->plid])
                                     ^^^^^^blkg->pd[pol->plid] != NULL
                                           bfqg != NULL
  pol->pd_free_fn(blkg->pd[pol->plid])
    pd_to_bfqg(blkg->pd[pol->plid])
    bfqg_put(bfqg)
      kfree(bfqg)
  blkg->pd[pol->plid] = NULL
  spin_unlock_irq(q->queue_lock);
                                   bfq_group_set_weight(bfqg, val, 0)
                                     bfqg->entity.new_weight
                                     ^^^^^^trigger uaf here
                                   spin_unlock_irq(&blkcg->lock);

Fix by grabbing the matching blkcg lock before trying to
destroy blkg policy data.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Li Jinlin <lijinlin3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210914042605.3260596-1-lijinlin3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblkcg: fix memory leak in blk_iolatency_init
Yanfei Xu [Wed, 15 Sep 2021 07:24:26 +0000 (15:24 +0800)]
blkcg: fix memory leak in blk_iolatency_init

BUG: memory leak
unreferenced object 0xffff888129acdb80 (size 96):
  comm "syz-executor.1", pid 12661, jiffies 4294962682 (age 15.220s)
  hex dump (first 32 bytes):
    20 47 c9 85 ff ff ff ff 20 d4 8e 29 81 88 ff ff   G...... ..)....
    01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff82264ec8>] kmalloc include/linux/slab.h:591 [inline]
    [<ffffffff82264ec8>] kzalloc include/linux/slab.h:721 [inline]
    [<ffffffff82264ec8>] blk_iolatency_init+0x28/0x190 block/blk-iolatency.c:724
    [<ffffffff8225b8c4>] blkcg_init_queue+0xb4/0x1c0 block/blk-cgroup.c:1185
    [<ffffffff822253da>] blk_alloc_queue+0x22a/0x2e0 block/blk-core.c:566
    [<ffffffff8223b175>] blk_mq_init_queue_data block/blk-mq.c:3100 [inline]
    [<ffffffff8223b175>] __blk_mq_alloc_disk+0x25/0xd0 block/blk-mq.c:3124
    [<ffffffff826a9303>] loop_add+0x1c3/0x360 drivers/block/loop.c:2344
    [<ffffffff826a966e>] loop_control_get_free drivers/block/loop.c:2501 [inline]
    [<ffffffff826a966e>] loop_control_ioctl+0x17e/0x2e0 drivers/block/loop.c:2516
    [<ffffffff81597eec>] vfs_ioctl fs/ioctl.c:51 [inline]
    [<ffffffff81597eec>] __do_sys_ioctl fs/ioctl.c:874 [inline]
    [<ffffffff81597eec>] __se_sys_ioctl fs/ioctl.c:860 [inline]
    [<ffffffff81597eec>] __x64_sys_ioctl+0xfc/0x140 fs/ioctl.c:860
    [<ffffffff843fa745>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    [<ffffffff843fa745>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
    [<ffffffff84600068>] entry_SYSCALL_64_after_hwframe+0x44/0xae

Once blk_throtl_init() queue init failed, blkcg_iolatency_exit() will
not be invoked for cleanup. That leads a memory leak. Swap the
blk_throtl_init() and blk_iolatency_init() calls can solve this.

Reported-by: syzbot+01321b15cc98e6bf96d6@syzkaller.appspotmail.com
Fixes: 19688d7f9592 (block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls)
Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210915072426.4022924-1-yanfei.xu@windriver.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoMerge tag 'nvme-5.15-2021-09-15' of git://git.infradead.org/nvme into block-5.15
Jens Axboe [Wed, 15 Sep 2021 13:53:32 +0000 (07:53 -0600)]
Merge tag 'nvme-5.15-2021-09-15' of git://git.infradead.org/nvme into block-5.15

Pull NVMe fixes from Christoph:

"nvme fixes for Linux 5.15

 - fix ANA state updates when a namespace is not present (Anton Eidelman)
 - nvmet: fix a width vs precision bug in nvmet_subsys_attr_serial_show
   (Dan Carpenter)
 - avoid race in shutdown namespace removal (Daniel Wagner)
 - fix io_work priority inversion in nvme-tcp (Keith Busch)
 - destroy cm id before destroy qp to avoid use after free (Ruozhu Li)"

* tag 'nvme-5.15-2021-09-15' of git://git.infradead.org/nvme:
  nvme-tcp: fix io_work priority inversion
  nvme-rdma: destroy cm id before destroy qp to avoid use after free
  nvme-multipath: fix ANA state updates when a namespace is not present
  nvme: avoid race in shutdown namespace removal
  nvmet: fix a width vs precision bug in nvmet_subsys_attr_serial_show()

3 years agonvme: remove the call to nvme_update_disk_info in nvme_ns_remove
Christoph Hellwig [Tue, 14 Sep 2021 07:06:57 +0000 (09:06 +0200)]
nvme: remove the call to nvme_update_disk_info in nvme_ns_remove

There is no need to explicitly unregister the integrity profile when
deleting the gendisk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20210914070657.87677-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: flush the integrity workqueue in blk_integrity_unregister
Lihong Kou [Tue, 14 Sep 2021 07:06:56 +0000 (09:06 +0200)]
block: flush the integrity workqueue in blk_integrity_unregister

When the integrity profile is unregistered there can still be integrity
reads queued up which could see a NULL verify_fn as shown by the race
window below:

CPU0                                    CPU1
  process_one_work                      nvme_validate_ns
    bio_integrity_verify_fn                nvme_update_ns_info
                                     nvme_update_disk_info
                                       blk_integrity_unregister
                                               ---set queue->integrity as 0
bio_integrity_process
--access bi->profile->verify_fn(bi is a pointer of queue->integity)

Before calling blk_integrity_unregister in nvme_update_disk_info, we must
make sure that there is no work item in the kintegrityd_wq. Just call
blk_flush_integrity to flush the work queue so the bug can be resolved.

Signed-off-by: Lihong Kou <koulihong@huawei.com>
[hch: split up and shortened the changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20210914070657.87677-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: check if a profile is actually registered in blk_integrity_unregister
Christoph Hellwig [Tue, 14 Sep 2021 07:06:55 +0000 (09:06 +0200)]
block: check if a profile is actually registered in blk_integrity_unregister

While clearing the profile itself is harmless, we really should not clear
the stable writes flag if it wasn't set due to a registered integrity
profile.

Reported-by: Lihong Kou <koulihong@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20210914070657.87677-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agonvme-tcp: fix io_work priority inversion
Keith Busch [Thu, 9 Sep 2021 15:54:52 +0000 (08:54 -0700)]
nvme-tcp: fix io_work priority inversion

Dispatching requests inline with the .queue_rq() call may block while
holding the send_mutex. If the tcp io_work also happens to schedule, it
may see the req_list is non-empty, leaving "pending" true and remaining
in TASK_RUNNING. Since io_work is of higher scheduling priority, the
.queue_rq task may not get a chance to run, blocking forward progress
and leading to io timeouts.

Instead of checking for pending requests within io_work, let the queueing
restart io_work outside the send_mutex lock if there is more work to be
done.

Fixes: a0fdd1418007f ("nvme-tcp: rerun io_work if req_list is not empty")
Reported-by: Samuel Jones <sjones@kalrayinc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme-rdma: destroy cm id before destroy qp to avoid use after free
Ruozhu Li [Mon, 6 Sep 2021 03:51:34 +0000 (11:51 +0800)]
nvme-rdma: destroy cm id before destroy qp to avoid use after free

We should always destroy cm_id before destroy qp to avoid to get cma
event after qp was destroyed, which may lead to use after free.
In RDMA connection establishment error flow, don't destroy qp in cm
event handler.Just report cm_error to upper level, qp will be destroy
in nvme_rdma_alloc_queue() after destroy cm id.

Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme-multipath: fix ANA state updates when a namespace is not present
Anton Eidelman [Sun, 12 Sep 2021 18:54:57 +0000 (12:54 -0600)]
nvme-multipath: fix ANA state updates when a namespace is not present

nvme_update_ana_state() has a deficiency that results in a failure to
properly update the ana state for a namespace in the following case:

  NSIDs in ctrl->namespaces: 1, 3,    4
  NSIDs in desc->nsids: 1, 2, 3, 4

Loop iteration 0:
    ns index = 0, n = 0, ns->head->ns_id = 1, nsid = 1, MATCH.
Loop iteration 1:
    ns index = 1, n = 1, ns->head->ns_id = 3, nsid = 2, NO MATCH.
Loop iteration 2:
    ns index = 2, n = 2, ns->head->ns_id = 4, nsid = 4, MATCH.

Where the update to the ANA state of NSID 3 is missed.  To fix this
increment n and retry the update with the same ns when ns->head->ns_id is
higher than nsid,

Signed-off-by: Anton Eidelman <anton@lightbitslabs.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
3 years agonvme: avoid race in shutdown namespace removal
Daniel Wagner [Thu, 2 Sep 2021 09:20:02 +0000 (11:20 +0200)]
nvme: avoid race in shutdown namespace removal

When we remove the siblings entry, we update ns->head->list, hence we
can't separate the removal and test for being empty. They have to be
in the same critical section to avoid a race.

To avoid breaking the refcounting imbalance again, add a list empty
check to nvme_find_ns_head.

Fixes: 5396fdac56d8 ("nvme: fix refcounting imbalance when all paths are down")
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvmet: fix a width vs precision bug in nvmet_subsys_attr_serial_show()
Dan Carpenter [Thu, 9 Sep 2021 09:14:40 +0000 (12:14 +0300)]
nvmet: fix a width vs precision bug in nvmet_subsys_attr_serial_show()

This was intended to limit the number of characters printed from
"subsys->serial" to NVMET_SN_MAX_SIZE.  But accidentally the width
specifier was used instead of the precision specifier so it only
affects the alignment and not the number of characters printed.

Fixes: f04064814c2a ("nvmet: fixup buffer overrun in nvmet_subsys_attr_serial()")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agoblk-mq: avoid to iterate over stale request
Ming Lei [Mon, 6 Sep 2021 06:50:03 +0000 (14:50 +0800)]
blk-mq: avoid to iterate over stale request

blk-mq can't run allocating driver tag and updating ->rqs[tag]
atomically, meantime blk-mq doesn't clear ->rqs[tag] after the driver
tag is released.

So there is chance to iterating over one stale request just after the
tag is allocated and before updating ->rqs[tag].

scsi_host_busy_iter() calls scsi_host_check_in_flight() to count scsi
in-flight requests after scsi host is blocked, so no new scsi command can
be marked as SCMD_STATE_INFLIGHT. However, driver tag allocation still can
be run by blk-mq core. One request is marked as SCMD_STATE_INFLIGHT,
but this request may have been kept in another slot of ->rqs[], meantime
the slot can be allocated out but ->rqs[] isn't updated yet. Then this
in-flight request is counted twice as SCMD_STATE_INFLIGHT. This way causes
trouble in handling scsi error.

Fixes the issue by not iterating over stale request.

Cc: linux-scsi@vger.kernel.org
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Reported-by: luojiaxing <luojiaxing@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210906065003.439019-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoLinux 5.15-rc1
Linus Torvalds [Sun, 12 Sep 2021 23:28:37 +0000 (16:28 -0700)]
Linux 5.15-rc1

3 years agoMerge tag 'perf-tools-for-v5.15-2021-09-11' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Sun, 12 Sep 2021 23:18:15 +0000 (16:18 -0700)]
Merge tag 'perf-tools-for-v5.15-2021-09-11' of git://git./linux/kernel/git/acme/linux

Pull more perf tools updates from Arnaldo Carvalho de Melo:

 - Add missing fields and remove some duplicate fields when printing a
   perf_event_attr.

 - Fix hybrid config terms list corruption.

 - Update kernel header copies, some resulted in new kernel features
   being automagically added to 'perf trace' syscall/tracepoint argument
   id->string translators.

 - Add a file generated during the documentation build to .gitignore.

 - Add an option to build without libbfd, as some distros, like Debian
   consider its ABI unstable.

 - Add support to print a textual representation of IBS raw sample data
   in 'perf report'.

 - Fix bpf 'perf test' sample mismatch reporting

 - Fix passing arguments to stackcollapse report in a 'perf script'
   python script.

 - Allow build-id with trailing zeros.

 - Look for ImageBase in PE file to compute .text offset.

* tag 'perf-tools-for-v5.15-2021-09-11' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (25 commits)
  tools headers UAPI: Update tools's copy of drm.h headers
  tools headers UAPI: Sync drm/i915_drm.h with the kernel sources
  tools headers UAPI: Sync linux/fs.h with the kernel sources
  tools headers UAPI: Sync linux/in.h copy with the kernel sources
  perf tools: Add an option to build without libbfd
  perf tools: Allow build-id with trailing zeros
  perf tools: Fix hybrid config terms list corruption
  perf tools: Factor out copy_config_terms() and free_config_terms()
  perf tools: Fix perf_event_attr__fprintf() missing/dupl. fields
  perf tools: Ignore Documentation dependency file
  perf bpf: Provide a weak btf__load_from_kernel_by_id() for older libbpf versions
  tools include UAPI: Update linux/mount.h copy
  perf beauty: Cover more flags in the  move_mount syscall argument beautifier
  tools headers UAPI: Sync linux/prctl.h with the kernel sources
  tools include UAPI: Sync sound/asound.h copy with the kernel sources
  tools headers UAPI: Sync linux/kvm.h with the kernel sources
  tools headers UAPI: Sync x86's asm/kvm.h with the kernel sources
  perf report: Add support to print a textual representation of IBS raw sample data
  perf report: Add tools/arch/x86/include/asm/amd-ibs.h
  perf env: Add perf_env__cpuid, perf_env__{nr_}pmu_mappings
  ...

3 years agoMerge tag 'compiler-attributes-for-linus-v5.15-rc1-v2' of git://github.com/ojeda...
Linus Torvalds [Sun, 12 Sep 2021 23:09:26 +0000 (16:09 -0700)]
Merge tag 'compiler-attributes-for-linus-v5.15-rc1-v2' of git://github.com/ojeda/linux

Pull compiler attributes updates from Miguel Ojeda:

 - Fix __has_attribute(__no_sanitize_coverage__) for GCC 4 (Marco Elver)

 - Add Nick as Reviewer for compiler_attributes.h (Nick Desaulniers)

 - Move __compiletime_{error|warning} (Nick Desaulniers)

* tag 'compiler-attributes-for-linus-v5.15-rc1-v2' of git://github.com/ojeda/linux:
  compiler_attributes.h: move __compiletime_{error|warning}
  MAINTAINERS: add Nick as Reviewer for compiler_attributes.h
  Compiler Attributes: fix __has_attribute(__no_sanitize_coverage__) for GCC 4

3 years agoMerge tag 'auxdisplay-for-linus-v5.15-rc1' of git://github.com/ojeda/linux
Linus Torvalds [Sun, 12 Sep 2021 23:00:49 +0000 (16:00 -0700)]
Merge tag 'auxdisplay-for-linus-v5.15-rc1' of git://github.com/ojeda/linux

Pull auxdisplay updates from Miguel Ojeda:
 "An assortment of improvements for auxdisplay:

   - Replace symbolic permissions with octal permissions (Jinchao Wang)

   - ks0108: Switch to use module_parport_driver() (Andy Shevchenko)

   - charlcd: Drop unneeded initializers and switch to C99 style (Andy
     Shevchenko)

   - hd44780: Fix oops on module unloading (Lars Poeschel)

   - Add I2C gpio expander example (Ralf Schlatterbeck)"

* tag 'auxdisplay-for-linus-v5.15-rc1' of git://github.com/ojeda/linux:
  auxdisplay: Replace symbolic permissions with octal permissions
  auxdisplay: ks0108: Switch to use module_parport_driver()
  auxdisplay: charlcd: Drop unneeded initializers and switch to C99 style
  auxdisplay: hd44780: Fix oops on module unloading
  auxdisplay: Add I2C gpio expander example

3 years agoMerge tag 'smp-urgent-2021-09-12' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 12 Sep 2021 19:42:51 +0000 (12:42 -0700)]
Merge tag 'smp-urgent-2021-09-12' of git://git./linux/kernel/git/tip/tip

Pull CPU hotplug updates from Thomas Gleixner:
 "Updates for the SMP and CPU hotplug:

   - Remove DEFINE_SMP_CALL_CACHE_FUNCTION() which is a left over of the
     original hotplug code and now causing trouble with the ARM64 cache
     topology setup due to the pointless SMP function call.

     It's not longer required as the hotplug callbacks are guaranteed to
     be invoked on the upcoming CPU.

   - Remove the deprecated and now unused CPU hotplug functions

   - Rewrite the CPU hotplug API documentation"

* tag 'smp-urgent-2021-09-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Documentation: core-api/cpuhotplug: Rewrite the API section
  cpu/hotplug: Remove deprecated CPU-hotplug functions.
  thermal: Replace deprecated CPU-hotplug functions.
  drivers: base: cacheinfo: Get rid of DEFINE_SMP_CALL_CACHE_FUNCTION()

3 years agoMerge tag 'char-misc-5.15-rc1-lkdtm' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 12 Sep 2021 18:56:00 +0000 (11:56 -0700)]
Merge tag 'char-misc-5.15-rc1-lkdtm' of git://git./linux/kernel/git/gregkh/char-misc

Pull misc driver fix from Greg KH:
 "Here is a single patch for 5.15-rc1, for the lkdtm misc driver.

  It resolves a build issue that many people were hitting with your
  current tree, and Kees and others felt would be good to get merged
  before -rc1 comes out, to prevent them from having to constantly hit
  it as many development trees restart on -rc1, not older -rc releases.

  It has NOT been in linux-next, but has passed 0-day testing and looks
  'obviously correct' when reviewing it locally :)"

* tag 'char-misc-5.15-rc1-lkdtm' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  lkdtm: Use init_uts_ns.name instead of macros

3 years agoMerge tag 'for-linus-5.15-1' of git://github.com/cminyard/linux-ipmi
Linus Torvalds [Sun, 12 Sep 2021 18:44:58 +0000 (11:44 -0700)]
Merge tag 'for-linus-5.15-1' of git://github.com/cminyard/linux-ipmi

Pull IPMI updates from Corey Minyard:
 "A couple of very minor fixes for style and rate limiting.

  Nothing big, but probably needs to go in"

* tag 'for-linus-5.15-1' of git://github.com/cminyard/linux-ipmi:
  char: ipmi: use DEVICE_ATTR helper macro
  ipmi: rate limit ipmi smi_event failure message

3 years agoMerge tag 'sched_urgent_for_v5.15_rc1' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 12 Sep 2021 18:37:41 +0000 (11:37 -0700)]
Merge tag 'sched_urgent_for_v5.15_rc1' of git://git./linux/kernel/git/tip/tip

Pull scheduler fixes from Borislav Petkov:

 - Make sure the idle timer expires in hardirq context, on PREEMPT_RT

 - Make sure the run-queue balance callback is invoked only on the
   outgoing CPU

* tag 'sched_urgent_for_v5.15_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Prevent balance_push() on remote runqueues
  sched/idle: Make the idle timer expire in hard interrupt context

3 years agoMerge tag 'locking_urgent_for_v5.15_rc1' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 12 Sep 2021 18:27:05 +0000 (11:27 -0700)]
Merge tag 'locking_urgent_for_v5.15_rc1' of git://git./linux/kernel/git/tip/tip

Pull locking fixes from Borislav Petkov:

 - Fix the futex PI requeue machinery to not return to userspace in
   inconsistent state

 - Avoid a potential null pointer dereference in the ww_mutex deadlock
   check

 - Other smaller cleanups and optimizations

* tag 'locking_urgent_for_v5.15_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/rtmutex: Fix ww_mutex deadlock check
  futex: Remove unused variable 'vpid' in futex_proxy_trylock_atomic()
  futex: Avoid redundant task lookup
  futex: Clarify comment for requeue_pi_wake_futex()
  futex: Prevent inconsistent state and exit race
  futex: Return error code instead of assigning it without effect
  locking/rwsem: Add missing __init_rwsem() for PREEMPT_RT

3 years agoMerge tag 'timers_urgent_for_v5.15_rc1' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 12 Sep 2021 18:10:31 +0000 (11:10 -0700)]
Merge tag 'timers_urgent_for_v5.15_rc1' of git://git./linux/kernel/git/tip/tip

Pull timer fix from Borislav Petkov:

 - Handle negative second values properly when converting a timespec64
   to nanoseconds.

* tag 'timers_urgent_for_v5.15_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  time: Handle negative seconds correctly in timespec64_to_ns()

3 years agoMerge branch 'misc.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Linus Torvalds [Sun, 12 Sep 2021 17:43:51 +0000 (10:43 -0700)]
Merge branch 'misc.namei' of git://git./linux/kernel/git/viro/vfs

Pull namei updates from Al Viro:
 "Clearing fallout from mkdirat in io_uring series. The fix in the
  kern_path_locked() patch plus associated cleanups"

* 'misc.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  putname(): IS_ERR_OR_NULL() is wrong here
  namei: Standardize callers of filename_create()
  namei: Standardize callers of filename_lookup()
  rename __filename_parentat() to filename_parentat()
  namei: Fix use after free in kern_path_locked

3 years agoMerge tag '5.15-rc-cifs-part2' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Sun, 12 Sep 2021 17:10:21 +0000 (10:10 -0700)]
Merge tag '5.15-rc-cifs-part2' of git://git.samba.org/sfrench/cifs-2.6

Pull smbfs updates from Steve French:
 "cifs/smb3 updates:

   - DFS reconnect fix

   - begin creating common headers for server and client

   - rename the cifs_common directory to smbfs_common to be more
     consistent ie change use of the name cifs to smb (smb3 or smbfs is
     more accurate, as the very old cifs dialect has long been
     superseded by smb3 dialects).

  In the future we can rename the fs/cifs directory to fs/smbfs.

  This does not include the set of multichannel fixes nor the two
  deferred close fixes (they are still being reviewed and tested)"

* tag '5.15-rc-cifs-part2' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: properly invalidate cached root handle when closing it
  cifs: move SMB FSCTL definitions to common code
  cifs: rename cifs_common to smbfs_common
  cifs: update FSCTL definitions

3 years agoMerge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Linus Torvalds [Sat, 11 Sep 2021 21:48:42 +0000 (14:48 -0700)]
Merge tag 'for_linus' of git://git./linux/kernel/git/mst/vhost

Pull virtio updates from Michael Tsirkin:

 - vduse driver ("vDPA Device in Userspace") supporting emulated virtio
   block devices

 - virtio-vsock support for end of record with SEQPACKET

 - vdpa: mac and mq support for ifcvf and mlx5

 - vdpa: management netlink for ifcvf

 - virtio-i2c, gpio dt bindings

 - misc fixes and cleanups

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (39 commits)
  Documentation: Add documentation for VDUSE
  vduse: Introduce VDUSE - vDPA Device in Userspace
  vduse: Implement an MMU-based software IOTLB
  vdpa: Support transferring virtual addressing during DMA mapping
  vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap()
  vdpa: Add an opaque pointer for vdpa_config_ops.dma_map()
  vhost-iotlb: Add an opaque pointer for vhost IOTLB
  vhost-vdpa: Handle the failure of vdpa_reset()
  vdpa: Add reset callback in vdpa_config_ops
  vdpa: Fix some coding style issues
  file: Export receive_fd() to modules
  eventfd: Export eventfd_wake_count to modules
  iova: Export alloc_iova_fast() and free_iova_fast()
  virtio-blk: remove unneeded "likely" statements
  virtio-balloon: Use virtio_find_vqs() helper
  vdpa: Make use of PFN_PHYS/PFN_UP/PFN_DOWN helper macro
  vsock_test: update message bounds test for MSG_EOR
  af_vsock: rename variables in receive loop
  virtio/vsock: support MSG_EOR bit processing
  vhost/vsock: support MSG_EOR bit processing
  ...

3 years agoMerge tag 'riscv-for-linus-5.15-mw1' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 11 Sep 2021 21:29:42 +0000 (14:29 -0700)]
Merge tag 'riscv-for-linus-5.15-mw1' of git://git./linux/kernel/git/riscv/linux

Pull more RISC-V updates from Palmer Dabbelt:

 - A pair of defconfig additions, for NVMe and the EFI filesystem
   localization options.

 - A larger address space for stack randomization.

 - A cleanup to our install rules.

 - A DTS update for the Microchip Icicle board, to fix the serial
   console.

 - Support for build-time table sorting, which allows us to have
   __ex_table read-only.

* tag 'riscv-for-linus-5.15-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: Move EXCEPTION_TABLE to RO_DATA segment
  riscv: Enable BUILDTIME_TABLE_SORT
  riscv: dts: microchip: mpfs-icicle: Fix serial console
  riscv: move the (z)install rules to arch/riscv/Makefile
  riscv: Improve stack randomisation on RV64
  riscv: defconfig: enable NLS_CODEPAGE_437, NLS_ISO8859_1
  riscv: defconfig: enable BLK_DEV_NVME

3 years agoMerge branch 'for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jlawall...
Linus Torvalds [Sat, 11 Sep 2021 21:22:28 +0000 (14:22 -0700)]
Merge branch 'for-5.15' of git://git./linux/kernel/git/jlawall/linux

Pull coccinelle updates from Julia Lawall:
 "These changes update some existing semantic patches with
  respect to some recent changes in the kernel.

  Specifically, the change to kvmalloc.cocci searches for
  kfree_sensitive rather than kzfree, and the change to
  use_after_iter.cocci adds list_entry_is_head as a valid
  use of a list iterator index variable after the end of
  the loop"

* 'for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jlawall/linux:
  scripts: coccinelle: allow list_entry_is_head() to use pos
  coccinelle: api: rename kzfree to kfree_sensitive

3 years agotools headers UAPI: Update tools's copy of drm.h headers
Arnaldo Carvalho de Melo [Mon, 3 May 2021 14:48:26 +0000 (11:48 -0300)]
tools headers UAPI: Update tools's copy of drm.h headers

Picking the changes from:

  17ce9c61c71cbc0d ("drm: document DRM_IOCTL_MODE_RMFB")

Doesn't result in any tooling changes:

  $ tools/perf/trace/beauty/drm_ioctl.sh  > before
  $ cp include/uapi/drm/drm.h tools/include/uapi/drm/drm.h
  $ tools/perf/trace/beauty/drm_ioctl.sh  > after
  $ diff -u before after

Silencing these perf build warnings:

  Warning: Kernel ABI header at 'tools/include/uapi/drm/drm.h' differs from latest version at 'include/uapi/drm/drm.h'
  diff -u tools/include/uapi/drm/drm.h include/uapi/drm/drm.h

Cc: Simon Ser <contact@emersion.fr>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
3 years agotools headers UAPI: Sync drm/i915_drm.h with the kernel sources
Arnaldo Carvalho de Melo [Sat, 11 Sep 2021 19:18:28 +0000 (16:18 -0300)]
tools headers UAPI: Sync drm/i915_drm.h with the kernel sources

To pick the changes in:

  b65a9489730a2494 ("drm/i915/userptr: Probe existence of backing struct pages upon creation")
  ee242ca704d38699 ("drm/i915/guc: Implement GuC priority management")
  81340cf3bddded4f ("drm/i915/uapi: reject set_domain for discrete")
  7961c5b60f23dff5 ("drm/i915: Add TTM offset argument to mmap.")
  aef7b67a79564f6c ("drm/i915/uapi: convert drm_i915_gem_userptr to kernel doc")
  e7737b67ab46ee0e ("drm/i915/uapi: reject caching ioctls for discrete")
  3aa8c57fe25a9247 ("drm/i915/uapi: convert drm_i915_gem_set_domain to kernel doc")
  289f5a72009b8f67 ("drm/i915/uapi: convert drm_i915_gem_caching to kernel doc")
  4a766ae40ec83301 ("drm/i915: Drop the CONTEXT_CLONE API (v2)")
  6ff6d61dd2a943bd ("drm/i915: Drop I915_CONTEXT_PARAM_NO_ZEROMAP")
  fe4751c3d513ff4f ("drm/i915: Drop I915_CONTEXT_PARAM_RINGSIZE")
  577729533cdc4e37 ("drm/i915: Document the Virtual Engine uAPI")
  c649432e86ca677d ("drm/i915: Fix busy ioctl commentary")

That doesn't result in any changes to tooling as no new ioctl were
added (at least not perceived by tools/perf/trace/beauty/drm_ioctl.sh).

Addressing this perf build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/drm/i915_drm.h' differs from latest version at 'include/uapi/drm/i915_drm.h'
  diff -u tools/include/uapi/drm/i915_drm.h include/uapi/drm/i915_drm.h

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Jason Ekstrand <jason@jlekstrand.net>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
3 years agotools headers UAPI: Sync linux/fs.h with the kernel sources
Arnaldo Carvalho de Melo [Fri, 21 May 2021 19:00:31 +0000 (16:00 -0300)]
tools headers UAPI: Sync linux/fs.h with the kernel sources

To pick the change in:

  7957d93bf32bc211 ("block: add ioctl to read the disk sequence number")

It adds a new ioctl, but we are still not using that to generate tables
for 'perf trace', so no changes in tooling.

This silences this perf build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/linux/fs.h' differs from latest version at 'include/uapi/linux/fs.h'
  diff -u tools/include/uapi/linux/fs.h include/uapi/linux/fs.h

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
3 years agotools headers UAPI: Sync linux/in.h copy with the kernel sources
Arnaldo Carvalho de Melo [Sat, 19 Jun 2021 13:15:22 +0000 (10:15 -0300)]
tools headers UAPI: Sync linux/in.h copy with the kernel sources

To pick the changes in:

  db243b796439c0ca ("net/ipv4/ipv6: Replace one-element arraya with flexible-array members")
  2d3e5caf96b9449a ("net/ipv4: Replace one-element array with flexible-array member")

That don't result in any change in tooling, the structs changed remains
with the same layout.

This addresses this build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/linux/in.h' differs from latest version at 'include/uapi/linux/in.h'
  diff -u tools/include/uapi/linux/in.h include/uapi/linux/in.h

Cc: David S. Miller <davem@davemloft.net>
Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
3 years agoperf tools: Add an option to build without libbfd
Ian Rogers [Fri, 10 Sep 2021 22:57:56 +0000 (15:57 -0700)]
perf tools: Add an option to build without libbfd

Some distributions, like debian, don't link perf with libbfd. Add a
build flag to make this configuration buildable and testable.

This was inspired by:

  https://lore.kernel.org/linux-perf-users/20210910102307.2055484-1-tonyg@leastfixedpoint.com/T/#u

Signed-off-by: Ian Rogers <irogers@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: tony garnock-jones <tonyg@leastfixedpoint.com>
Link: http://lore.kernel.org/lkml/20210910225756.729087-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
3 years agoperf tools: Allow build-id with trailing zeros
Namhyung Kim [Fri, 10 Sep 2021 22:46:30 +0000 (15:46 -0700)]
perf tools: Allow build-id with trailing zeros

Currently perf saves a build-id with size but old versions assumes the
size of 20.  In case the build-id is less than 20 (like for MD5), it'd
fill the rest with 0s.

I saw a problem when old version of perf record saved a binary in the
build-id cache and new version of perf reads the data.  The symbols
should be read from the build-id cache (as the path no longer has the
same binary) but it failed due to mismatch in the build-id.

  symsrc__init: build id mismatch for /home/namhyung/.debug/.build-id/53/e4c2f42a4c61a2d632d92a72afa08f00000000/elf.

The build-id event in the data has 20 byte build-ids, but it saw a
different size (16) when it reads the build-id of the elf file in the
build-id cache.

  $ readelf -n ~/.debug/.build-id/53/e4c2f42a4c61a2d632d92a72afa08f00000000/elf

  Displaying notes found in: .note.gnu.build-id
    Owner                Data size  Description
    GNU                  0x00000010 NT_GNU_BUILD_ID (unique build ID bitstring)
      Build ID: 53e4c2f42a4c61a2d632d92a72afa08f

Let's fix this by allowing trailing zeros if the size is different.

Fixes: 39be8d0115b321ed ("perf tools: Pass build_id object to dso__build_id_equal()")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/20210910224630.1084877-1-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
3 years agoperf tools: Fix hybrid config terms list corruption
Adrian Hunter [Thu, 9 Sep 2021 12:55:08 +0000 (15:55 +0300)]
perf tools: Fix hybrid config terms list corruption

A config terms list was spliced twice, resulting in a never-ending loop
when the list was traversed. Fix by using list_splice_init() and copying
and freeing the lists as necessary.

This patch also depends on patch "perf tools: Factor out
copy_config_terms() and free_config_terms()"

Example on ADL:

 Before:

  # perf record -e '{intel_pt//,cycles/aux-sample-size=4096/pp}' uname &
  # jobs
  [1]+  Running                    perf record -e "{intel_pt//,cycles/aux-sample-size=4096/pp}" uname
  # perf top -E 10
    PerfTop:    4071 irqs/sec  kernel: 6.9%  exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles],  (all, 24 CPUs)
  ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    97.60%  perf           [.] __evsel__get_config_term
     0.25%  [kernel]       [k] kallsyms_expand_symbol.constprop.13
     0.24%  perf           [.] kallsyms__parse
     0.15%  [kernel]       [k] _raw_spin_lock
     0.14%  [kernel]       [k] number
     0.13%  [kernel]       [k] advance_transaction
     0.08%  [kernel]       [k] format_decode
     0.08%  perf           [.] map__process_kallsym_symbol
     0.08%  perf           [.] rb_insert_color
     0.08%  [kernel]       [k] vsnprintf
  exiting.
  # kill %1

After:

  # perf record -e '{intel_pt//,cycles/aux-sample-size=4096/pp}' uname &
  Linux
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.060 MB perf.data ]
  # perf script | head
       perf-exec   604 [001]  1827.312293:                            psb:  psb offs: 0                       ffffffffb8415e87 pt_config_start+0x37 ([kernel.kallsyms])
       perf-exec   604  1827.312293:          1                       branches:  ffffffffb856a3bd event_sched_in.isra.133+0xfd ([kernel.kallsyms]) => ffffffffb856a9a0 perf_pmu_nop_void+0x0 ([kernel.kallsyms])
       perf-exec   604  1827.312293:          1                       branches:  ffffffffb856b10e merge_sched_in+0x26e ([kernel.kallsyms]) => ffffffffb856a2c0 event_sched_in.isra.133+0x0 ([kernel.kallsyms])
       perf-exec   604  1827.312293:          1                       branches:  ffffffffb856a45d event_sched_in.isra.133+0x19d ([kernel.kallsyms]) => ffffffffb8568b80 perf_event_set_state.part.61+0x0 ([kernel.kallsyms])
       perf-exec   604  1827.312293:          1                       branches:  ffffffffb8568b86 perf_event_set_state.part.61+0x6 ([kernel.kallsyms]) => ffffffffb85662a0 perf_event_update_time+0x0 ([kernel.kallsyms])
       perf-exec   604  1827.312293:          1                       branches:  ffffffffb856a35c event_sched_in.isra.133+0x9c ([kernel.kallsyms]) => ffffffffb8567610 perf_log_itrace_start+0x0 ([kernel.kallsyms])
       perf-exec   604  1827.312293:          1                       branches:  ffffffffb856a377 event_sched_in.isra.133+0xb7 ([kernel.kallsyms]) => ffffffffb8403b40 x86_pmu_add+0x0 ([kernel.kallsyms])
       perf-exec   604  1827.312293:          1                       branches:  ffffffffb8403b86 x86_pmu_add+0x46 ([kernel.kallsyms]) => ffffffffb8403940 collect_events+0x0 ([kernel.kallsyms])
       perf-exec   604  1827.312293:          1                       branches:  ffffffffb8403a7b collect_events+0x13b ([kernel.kallsyms]) => ffffffffb8402cd0 collect_event+0x0 ([kernel.kallsyms])

Fixes: 30def61f64bac5 ("perf parse-events Create two hybrid cache events")
Fixes: 94da591b1c7913 ("perf parse-events Create two hybrid raw events")
Fixes: 9cbfa2f64c04d9 ("perf parse-events Create two hybrid hardware events")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Link: https //lore.kernel.org/r/20210909125508.28693-3-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
3 years agoperf tools: Factor out copy_config_terms() and free_config_terms()
Adrian Hunter [Thu, 9 Sep 2021 12:55:07 +0000 (15:55 +0300)]
perf tools: Factor out copy_config_terms() and free_config_terms()

Factor out copy_config_terms() and free_config_terms() so that they can
be reused.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Link: https //lore.kernel.org/r/20210909125508.28693-2-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
3 years agoperf tools: Fix perf_event_attr__fprintf() missing/dupl. fields
Adrian Hunter [Sat, 11 Sep 2021 12:05:50 +0000 (15:05 +0300)]
perf tools: Fix perf_event_attr__fprintf() missing/dupl. fields

Some fields are missing and text_poke is duplicated. Fix that up.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lore.kernel.org/lkml/20210911120550.12203-1-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
3 years agoperf tools: Ignore Documentation dependency file
Ian Rogers [Fri, 10 Sep 2021 23:22:49 +0000 (16:22 -0700)]
perf tools: Ignore Documentation dependency file

When building directly on the checked out repository the build process
produces a file that should be ignored, so add it to .gitignore.

Fixes: a81df63a5df3e195 ("perf doc: Fix doc.dep")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/20210910232249.739661-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
3 years agoMerge tag 'io_uring-5.15-2021-09-11' of git://git.kernel.dk/linux-block
Linus Torvalds [Sat, 11 Sep 2021 17:28:14 +0000 (10:28 -0700)]
Merge tag 'io_uring-5.15-2021-09-11' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

 - Fix an off-by-one in a BUILD_BUG_ON() check. Not a real issue right
   now as we have plenty of flags left, but could become one. (Hao)

 - Fix lockdep issue introduced in this merge window (me)

 - Fix a few issues with the worker creation (me, Pavel, Qiang)

 - Fix regression with wq_has_sleeper() for IOPOLL (Pavel)

 - Timeout link error propagation fix (Pavel)

* tag 'io_uring-5.15-2021-09-11' of git://git.kernel.dk/linux-block:
  io_uring: fix off-by-one in BUILD_BUG_ON check of __REQ_F_LAST_BIT
  io_uring: fail links of cancelled timeouts
  io-wq: fix memory leak in create_io_worker()
  io-wq: fix silly logic error in io_task_work_match()
  io_uring: drop ctx->uring_lock before acquiring sqd->lock
  io_uring: fix missing mb() before waitqueue_active
  io-wq: fix cancellation on create-worker failure

3 years agoMerge tag 'block-5.15-2021-09-11' of git://git.kernel.dk/linux-block
Linus Torvalds [Sat, 11 Sep 2021 17:19:51 +0000 (10:19 -0700)]
Merge tag 'block-5.15-2021-09-11' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull request from Christoph:
     - fix nvmet command set reporting for passthrough controllers (Adam Manzanares)
     - update a MAINTAINERS email address (Chaitanya Kulkarni)
     - set QUEUE_FLAG_NOWAIT for nvme-multipth (me)
     - handle errors from add_disk() (Luis Chamberlain)
     - update the keep alive interval when kato is modified (Tatsuya Sasaki)
     - fix a buffer overrun in nvmet_subsys_attr_serial (Hannes Reinecke)
     - do not reset transport on data digest errors in nvme-tcp (Daniel Wagner)
     - only call synchronize_srcu when clearing current path (Daniel Wagner)
     - revalidate paths during rescan (Hannes Reinecke)

 - Split out the fs/block_dev into block/fops.c and block/bdev.c, which
   has been long overdue. Do this now before -rc1, to avoid annoying
   conflicts due to this (Christoph)

 - blk-throtl use-after-free fix (Li)

 - Improve plug depth for multi-device plugs, greatly increasing md
   resync performance (Song)

 - blkdev_show() locking fix (Tetsuo)

 - n64cart error check fix (Yang)

* tag 'block-5.15-2021-09-11' of git://git.kernel.dk/linux-block:
  n64cart: fix return value check in n64cart_probe()
  blk-mq: allow 4x BLK_MAX_REQUEST_COUNT at blk_plug for multiple_queues
  block: move fs/block_dev.c to block/bdev.c
  block: split out operations on block special files
  blk-throttle: fix UAF by deleteing timer in blk_throtl_exit()
  block: genhd: don't call blkdev_show() with major_names_lock held
  nvme: update MAINTAINERS email address
  nvme: add error handling support for add_disk()
  nvme: only call synchronize_srcu when clearing current path
  nvme: update keep alive interval when kato is modified
  nvme-tcp: Do not reset transport on data digest errors
  nvmet: fixup buffer overrun in nvmet_subsys_attr_serial()
  nvmet: return bool from nvmet_passthru_ctrl and nvmet_is_passthru_req
  nvmet: looks at the passthrough controller when initializing CAP
  nvme: move nvme_multi_css into nvme.h
  nvme-multipath: revalidate paths during rescan
  nvme-multipath: set QUEUE_FLAG_NOWAIT

3 years agoMerge tag 'libata-5.15-2021-09-11' of git://git.kernel.dk/linux-block
Linus Torvalds [Sat, 11 Sep 2021 17:18:20 +0000 (10:18 -0700)]
Merge tag 'libata-5.15-2021-09-11' of git://git.kernel.dk/linux-block

Pull libata maintainer update from Jens Axboe:
 "Damien agreed to take over maintainership of libata, and he would be a
  great candidate for it. Update the MAINTAINERS entry to reflect the
  change in maintainer and git tree"

* tag 'libata-5.15-2021-09-11' of git://git.kernel.dk/linux-block:
  libata: pass over maintainership to Damien Le Moal

3 years agoMerge tag 'trace-v5.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt...
Linus Torvalds [Sat, 11 Sep 2021 17:16:30 +0000 (10:16 -0700)]
Merge tag 'trace-v5.15-3' of git://git./linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Minor fixes to the processing of the bootconfig tree"

* tag 'trace-v5.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  bootconfig: Rename xbc_node_find_child() to xbc_node_find_subkey()
  tracing/boot: Fix to check the histogram control param is a leaf node
  tracing/boot: Fix trace_boot_hist_add_array() to check array is value

3 years agoMerge tag 'devicetree-fixes-for-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 11 Sep 2021 17:12:46 +0000 (10:12 -0700)]
Merge tag 'devicetree-fixes-for-5.15-1' of git://git./linux/kernel/git/robh/linux

Pull devicetree fixes from Rob Herring:

 - Disable fw_devlinks on x86 DT platforms to fix OLPC

 - More replacing oneOf+const with enum on a few new schemas

 - Drop unnecessary type references on Xilinx SPI binding schema

* tag 'devicetree-fixes-for-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
  spi: dt-bindings: xilinx: Drop type reference on *-bits properties
  dt-bindings: More use 'enum' instead of 'oneOf' plus 'const' entries
  of: property: Disable fw_devlink DT support for X86

3 years agoMerge tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Linus Torvalds [Sat, 11 Sep 2021 17:05:56 +0000 (10:05 -0700)]
Merge tag 'clk-for-linus' of git://git./linux/kernel/git/clk/linux

Pull clk fix from Stephen Boyd:
 "One patch to fix an unused variable warning in a Qualcomm clk driver"

* tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  clk: qcom: gcc-sm6350: Remove unused variable

3 years agoMerge tag 'rtc-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux
Linus Torvalds [Sat, 11 Sep 2021 16:54:53 +0000 (09:54 -0700)]
Merge tag 'rtc-5.15' of git://git./linux/kernel/git/abelloni/linux

Pull RTC updates from Alexandre Belloni:
 "The broken down time conversion is similar to what is done in the time
  subsystem since v5.14. The rest is fairly straightforward.

  Subsystem:
   - Switch to Neri and Schneider time conversion algorithm

  Drivers:
   - rx8025: add rx8035 support
   - s5m: modernize driver and set range"

* tag 'rtc-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux:
  rtc: rx8010: select REGMAP_I2C
  dt-bindings: rtc: add Epson RX-8025 and RX-8035
  rtc: rx8025: implement RX-8035 support
  rtc: cmos: remove stale REVISIT comments
  rtc: tps65910: Correct driver module alias
  rtc: move RTC_LIB_KUNIT_TEST to proper location
  rtc: lib_test: add MODULE_LICENSE
  rtc: Improve performance of rtc_time64_to_tm(). Add tests.
  rtc: s5m: set range
  rtc: s5m: enable wakeup only when available
  rtc: s5m: signal the core when alarm are not available
  rtc: s5m: switch to devm_rtc_allocate_device

3 years agoMerge tag 'firewire-update' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394...
Linus Torvalds [Sat, 11 Sep 2021 16:47:33 +0000 (09:47 -0700)]
Merge tag 'firewire-update' of git://git./linux/kernel/git/ieee1394/linux1394

Pull firewire updates from Stefan Richter:

 - Migrate the bus snooper driver 'nosy' from PCI to DMA API

 - Small janitorial cleanup in the IPv4/v6-over-1394 driver

[ The 'nosy' change already come in as a different commit through Greg
  KH in the misc tree back in the previous merge window, so only the
  cleanup ends up being new to 5.15   - Linus ]

* tag 'firewire-update' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
  firewire: nosy: switch from 'pci_' to 'dma_' API
  firewire: net: remove unused variable 'guid'

3 years agoMerge tag 'pwm/for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry...
Linus Torvalds [Sat, 11 Sep 2021 16:26:00 +0000 (09:26 -0700)]
Merge tag 'pwm/for-5.15-rc1' of git://git./linux/kernel/git/thierry.reding/linux-pwm

Pull pwm updates from Thierry Reding:
 "The changes this time around are mostly janitorial in nature. A lot of
  this is simplifications of drivers using device-managed functions and
  improving compilation coverage.

  The Mediatek display PWM driver now supports the atomic API.

  Cleanups and minor fixes make up the remainder of this set"

* tag 'pwm/for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm: (54 commits)
  pwm: mtk-disp: Implement atomic API .get_state()
  pwm: mtk-disp: Fix overflow in period and duty calculation
  pwm: mtk-disp: Implement atomic API .apply()
  pwm: mtk-disp: Adjust the clocks to avoid them mismatch
  dt-bindings: pwm: rockchip: Add description for rk3568
  pwm: Make pwmchip_remove() return void
  pwm: sun4i: Don't check the return code of pwmchip_remove()
  pwm: sifive: Don't check the return code of pwmchip_remove()
  pwm: samsung: Don't check the return code of pwmchip_remove()
  pwm: renesas-tpu: Don't check the return code of pwmchip_remove()
  pwm: rcar: Don't check the return code of pwmchip_remove()
  pwm: pca9685: Don't check the return code of pwmchip_remove()
  pwm: omap-dmtimer: Don't check the return code of pwmchip_remove()
  pwm: mtk-disp: Don't check the return code of pwmchip_remove()
  pwm: imx-tpm: Don't check the return code of pwmchip_remove()
  pwm: img: Don't check the return code of pwmchip_remove()
  pwm: cros-ec: Don't check the return code of pwmchip_remove()
  pwm: brcmstb: Don't check the return code of pwmchip_remove()
  pwm: atmel-tcb: Don't check the return code of pwmchip_remove()
  pwm: atmel-hlcdc: Don't check the return code of pwmchip_remove()
  ...

3 years agoMerge tag 'thermal-v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/therma...
Linus Torvalds [Sat, 11 Sep 2021 16:20:57 +0000 (09:20 -0700)]
Merge tag 'thermal-v5.15-rc1' of git://git./linux/kernel/git/thermal/linux

Pull thermal updates from Daniel Lezcano:

 - Add the tegra3 thermal sensor and fix the compilation testing on
   tegra by adding a dependency on ARCH_TEGRA along with COMPILE_TEST
   (Dmitry Osipenko)

 - Fix the error code for the exynos when devm_get_clk() fails (Dan
   Carpenter)

 - Add the TCC cooling support for AlderLake platform (Sumeet Pawnikar)

 - Add support for hardware trip points for the rcar gen3 thermal driver
   and store TSC id as unsigned int (Niklas Söderlund)

 - Replace the deprecated CPU-hotplug functions get_online_cpus() and
   put_online_cpus (Sebastian Andrzej Siewior)

 - Add the thermal tools directory in the MAINTAINERS file (Daniel
   Lezcano)

 - Fix the Makefile and the cross compilation flags for the userspace
   'tmon' tool (Rolf Eike Beer)

 - Allow to use the IMOK independently from the GDDV on Int340x (Sumeet
   Pawnikar)

 - Fix the stub thermal_cooling_device_register() function prototype
   which does not match the real function (Arnd Bergmann)

 - Make the thermal trip point optional in the DT bindings (Maxime
   Ripard)

 - Fix a typo in a comment in the core code (Geert Uytterhoeven)

 - Reduce the verbosity of the trace in the SoC thermal tegra driver
   (Dmitry Osipenko)

 - Add the support for the LMh (Limit Management hardware) driver on the
   QCom platforms (Thara Gopinath)

 - Allow processing of HWP interrupt by adding a weak function in the
   Intel driver (Srinivas Pandruvada)

 - Prevent an abort of the sensor probe is a channel is not used
   (Matthias Kaehlcke)

* tag 'thermal-v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux:
  thermal/drivers/qcom/spmi-adc-tm5: Don't abort probing if a sensor is not used
  thermal/drivers/intel: Allow processing of HWP interrupt
  dt-bindings: thermal: Add dt binding for QCOM LMh
  thermal/drivers/qcom: Add support for LMh driver
  firmware: qcom_scm: Introduce SCM calls to access LMh
  thermal/drivers/tegra-soctherm: Silence message about clamped temperature
  thermal: Spelling s/scallbacks/callbacks/
  dt-bindings: thermal: Make trips node optional
  thermal/core: Fix thermal_cooling_device_register() prototype
  thermal/drivers/int340x: Use IMOK independently
  tools/thermal/tmon: Add cross compiling support
  thermal/tools/tmon: Improve the Makefile
  MAINTAINERS: Add missing userspace thermal tools to the thermal section
  thermal/drivers/intel_powerclamp: Replace deprecated CPU-hotplug functions.
  thermal/drivers/rcar_gen3_thermal: Store TSC id as unsigned int
  thermal/drivers/rcar_gen3_thermal: Add support for hardware trip points
  drivers/thermal/intel: Add TCC cooling support for AlderLake platform
  thermal/drivers/exynos: Fix an error code in exynos_tmu_probe()
  thermal/drivers/tegra: Correct compile-testing of drivers
  thermal/drivers/tegra: Add driver for Tegra30 thermal sensor

3 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Linus Torvalds [Sat, 11 Sep 2021 16:08:28 +0000 (09:08 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input

Pull input updates from Dmitry Torokhov:

 - several device tree bindings for input devices have been converted to
   yaml

 - dropped no longer used ixp4xx-beeper and CSR Prima2 PWRC drivers

 - analog joystick has been converted to use ktime API and no longer
   warn about low resolution timers

 - a few driver fixes

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (24 commits)
  Input: analog - always use ktime functions
  Input: mms114 - support MMS134S
  Input: elan_i2c - reduce the resume time for controller in Whitebox
  Input: edt-ft5x06 - added case for EDT EP0110M09
  Input: adc-keys - drop bogus __refdata annotation
  Input: Fix spelling mistake in Kconfig "useable" -> "usable"
  Input: Fix spelling mistake in Kconfig "Modul" -> "Module"
  Input: remove dead CSR Prima2 PWRC driver
  Input: adp5589-keys - use the right header
  Input: adp5588-keys - use the right header
  dt-bindings: input: tsc2005: Convert to YAML schema
  Input: ep93xx_keypad - prepare clock before using it
  dt-bindings: input: sun4i-lradc: Add wakeup-source
  dt-bindings: input: Convert Regulator Haptic binding to a schema
  dt-bindings: input: Convert Pixcir Touchscreen binding to a schema
  dt-bindings: input: Convert ChipOne ICN8318 binding to a schema
  Input: pm8941-pwrkey - fix comma vs semicolon issue
  dt-bindings: power: reset: qcom-pon: Convert qcom PON binding to yaml
  dt-bindings: input: pm8941-pwrkey: Convert pm8941 power key binding to yaml
  dt-bindings: power: reset: Change 'additionalProperties' to true
  ...

3 years agoriscv: Move EXCEPTION_TABLE to RO_DATA segment
Jisheng Zhang [Thu, 26 Aug 2021 14:11:18 +0000 (22:11 +0800)]
riscv: Move EXCEPTION_TABLE to RO_DATA segment

_ex_table section is read-only, so move it to RO_DATA.

Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
3 years agoriscv: Enable BUILDTIME_TABLE_SORT
Jisheng Zhang [Thu, 26 Aug 2021 14:10:29 +0000 (22:10 +0800)]
riscv: Enable BUILDTIME_TABLE_SORT

Enable BUILDTIME_TABLE_SORT to sort the exception table at build time
rather than during boot.

Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
3 years agoriscv: dts: microchip: mpfs-icicle: Fix serial console
Geert Uytterhoeven [Thu, 26 Aug 2021 13:19:39 +0000 (15:19 +0200)]
riscv: dts: microchip: mpfs-icicle: Fix serial console

Currently, nothing is output on the serial console, unless
"console=ttyS0,115200n8" or "earlycon" are appended to the kernel
command line.  Enable automatic console selection using
chosen/stdout-path by adding a proper alias, and configure the expected
serial rate.

While at it, add aliases for the other three serial ports, which are
provided on the same micro-USB connector as the first one.

Fixes: 0fa6107eca4186ad ("RISC-V: Initial DTS for Microchip ICICLE board")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Bin Meng <bmeng.cn@gmail.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
3 years agoriscv: move the (z)install rules to arch/riscv/Makefile
Masahiro Yamada [Thu, 29 Jul 2021 14:21:47 +0000 (23:21 +0900)]
riscv: move the (z)install rules to arch/riscv/Makefile

Currently, the (z)install targets in arch/riscv/Makefile descend into
arch/riscv/boot/Makefile to invoke the shell script, but there is no
good reason to do so.

arch/riscv/Makefile can run the shell script directly.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
3 years agoriscv: Improve stack randomisation on RV64
Kefeng Wang [Thu, 12 Aug 2021 11:47:02 +0000 (19:47 +0800)]
riscv: Improve stack randomisation on RV64

This enlarges the bits availiable for stack randomisation on RV64 from
the default of 8MiB to 1GiB, to match arm64 and x86.

Also, update the documentation to reflect our support for stack
randomisation.

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
[Palmer: commit text]
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
3 years agoriscv: defconfig: enable NLS_CODEPAGE_437, NLS_ISO8859_1
Heinrich Schuchardt [Thu, 12 Aug 2021 08:10:27 +0000 (10:10 +0200)]
riscv: defconfig: enable NLS_CODEPAGE_437, NLS_ISO8859_1

The EFI system partition uses the FAT file system. Many distributions add
an entry in /etc/fstab for the ESP. We must ensure that mounting does not
fail.

The default code page for FAT is 437 (cf. CONFIG_FAT_DEFAULT_CODEPAGE).
The default IO character set is "iso8859-1" (cf. CONFIG_NLS_ISO8859_1).

So let's enable NLS_CODEPAGE_437 and NLS_ISO8859_1 in defconfig.

Signed-off-by: Heinrich Schuchardt <heinrich.schuchardt@canonical.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
3 years agoriscv: defconfig: enable BLK_DEV_NVME
Heinrich Schuchardt [Thu, 12 Aug 2021 08:10:26 +0000 (10:10 +0200)]
riscv: defconfig: enable BLK_DEV_NVME

NVMe is a non-volatile storage media attached via PCIe.
As NVMe has much higher throughput than other block devices like
SATA it is a must have for RISC-V. Enable CONFIG_BLK_DEV_NVME.

The HiFive Unmatched is a board providing M.2 slots for NVMe drives.
Enable CONFIG_PCIE_FU740.

Signed-off-by: Heinrich Schuchardt <heinrich.schuchardt@canonical.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
3 years agoDocumentation: core-api/cpuhotplug: Rewrite the API section
Thomas Gleixner [Thu, 9 Sep 2021 12:34:59 +0000 (14:34 +0200)]
Documentation: core-api/cpuhotplug: Rewrite the API section

Dave stumbled over the incomplete and confusing documentation of the CPU
hotplug API.

Rewrite it, add the missing function documentations and correct the
existing ones.

Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210909123212.489059409@linutronix.de
3 years agocpu/hotplug: Remove deprecated CPU-hotplug functions.
Sebastian Andrzej Siewior [Tue, 3 Aug 2021 14:16:21 +0000 (16:16 +0200)]
cpu/hotplug: Remove deprecated CPU-hotplug functions.

No users in tree use the deprecated CPU-hotplug functions anymore.

Remove them.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210803141621.780504-39-bigeasy@linutronix.de
3 years agothermal: Replace deprecated CPU-hotplug functions.
Sebastian Andrzej Siewior [Tue, 3 Aug 2021 14:16:02 +0000 (16:16 +0200)]
thermal: Replace deprecated CPU-hotplug functions.

The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().

Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210803141621.780504-20-bigeasy@linutronix.de
3 years agoMerge branch 'linus' into smp/urgent
Thomas Gleixner [Fri, 10 Sep 2021 22:38:47 +0000 (00:38 +0200)]
Merge branch 'linus' into smp/urgent

Ensure that all usage sites of get/put_online_cpus() except for the
struggler in drivers/thermal are gone. So the last user and the deprecated
inlines can be removed.