review.tizen.org Git - platform/kernel/linux-starfive.git/log

virtio-crypto: enable retry for virtio-crypto-dev

Enable retry for virtio-crypto-dev, so that crypto-engine
can process cipher-requests parallelly.

Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Gonglei <arei.gonglei@huawei.com>
Reviewed-by: Gonglei <arei.gonglei@huawei.com>
Signed-off-by: lei he <helei.sig11@bytedance.com>
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Message-Id: <20220506131627.180784-6-pizhenwei@bytedance.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

virtio-crypto: adjust dst_len at ops callback

For some akcipher operations(eg, decryption of pkcs1pad(rsa)),
the length of returned result maybe less than akcipher_req->dst_len,
we need to recalculate the actual dst_len through the virt-queue
protocol.

Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Gonglei <arei.gonglei@huawei.com>
Reviewed-by: Gonglei <arei.gonglei@huawei.com>
Signed-off-by: lei he <helei.sig11@bytedance.com>
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Message-Id: <20220506131627.180784-5-pizhenwei@bytedance.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

virtio-crypto: wait ctrl queue instead of busy polling

Originally, after submitting request into virtio crypto control
queue, the guest side polls the result from the virt queue. This
works like following:
    CPU0   CPU1               ...             CPUx  CPUy
     |      |                                  |     |
     \      \                                  /     /
      \--------spin_lock(&vcrypto->ctrl_lock)-------/
                           |
                 virtqueue add & kick
                           |
                  busy poll virtqueue
                           |
              spin_unlock(&vcrypto->ctrl_lock)
                          ...

There are two problems:
1, The queue depth is always 1, the performance of a virtio crypto
   device gets limited. Multi user processes share a single control
   queue, and hit spin lock race from control queue. Test on Intel
   Platinum 8260, a single worker gets ~35K/s create/close session
   operations, and 8 workers get ~40K/s operations with 800% CPU
   utilization.
2, The control request is supposed to get handled immediately, but
   in the current implementation of QEMU(v6.2), the vCPU thread kicks
   another thread to do this work, the latency also gets unstable.
   Tracking latency of virtio_crypto_alg_akcipher_close_session in 5s:
        usecs               : count     distribution
         0 -> 1          : 0        |                        |
         2 -> 3          : 7        |                        |
         4 -> 7          : 72       |                        |
         8 -> 15         : 186485   |************************|
        16 -> 31         : 687      |                        |
        32 -> 63         : 5        |                        |
        64 -> 127        : 3        |                        |
       128 -> 255        : 1        |                        |
       256 -> 511        : 0        |                        |
       512 -> 1023       : 0        |                        |
      1024 -> 2047       : 0        |                        |
      2048 -> 4095       : 0        |                        |
      4096 -> 8191       : 0        |                        |
      8192 -> 16383      : 2        |                        |
This means that a CPU may hold vcrypto->ctrl_lock as long as 8192~16383us.

To improve the performance of control queue, a request on control queue
waits completion instead of busy polling to reduce lock racing, and gets
completed by control queue callback.
    CPU0   CPU1               ...             CPUx  CPUy
     |      |                                  |     |
     \      \                                  /     /
      \--------spin_lock(&vcrypto->ctrl_lock)-------/
                           |
                 virtqueue add & kick
                           |
      ---------spin_unlock(&vcrypto->ctrl_lock)------
     /      /                                  \     \
     |      |                                  |     |
    wait   wait                               wait  wait

Test this patch, the guest side get ~200K/s operations with 300% CPU
utilization.

Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Gonglei <arei.gonglei@huawei.com>
Reviewed-by: Gonglei <arei.gonglei@huawei.com>
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Message-Id: <20220506131627.180784-4-pizhenwei@bytedance.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

virtio-crypto: use private buffer for control request

Originally, all of the control requests share a single buffer(
ctrl & input & ctrl_status fields in struct virtio_crypto), this
allows queue depth 1 only, the performance of control queue gets
limited by this design.

In this patch, each request allocates request buffer dynamically, and
free buffer after request, so the scope protected by ctrl_lock also
get optimized here.
It's possible to optimize control queue depth in the next step.

A necessary comment is already in code, still describe it again:
/*
* Note: there are padding fields in request, clear them to zero before
* sending to host to avoid to divulge any information.
* Ex, virtio_crypto_ctrl_request::ctrl::u::destroy_session::padding[48]
*/
So use kzalloc to allocate buffer of struct virtio_crypto_ctrl_request.

Potentially dereferencing uninitialized variables:
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Gonglei <arei.gonglei@huawei.com>
Reviewed-by: Gonglei <arei.gonglei@huawei.com>
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Message-Id: <20220506131627.180784-3-pizhenwei@bytedance.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

virtio-crypto: change code style

Use temporary variable to make code easy to read and maintain.
/* Pad cipher's parameters */
        vcrypto->ctrl.u.sym_create_session.op_type =
                cpu_to_le32(VIRTIO_CRYPTO_SYM_OP_CIPHER);
        vcrypto->ctrl.u.sym_create_session.u.cipher.para.algo =
                vcrypto->ctrl.header.algo;
        vcrypto->ctrl.u.sym_create_session.u.cipher.para.keylen =
                cpu_to_le32(keylen);
        vcrypto->ctrl.u.sym_create_session.u.cipher.para.op =
                cpu_to_le32(op);
-->
sym_create_session = &ctrl->u.sym_create_session;
sym_create_session->op_type = cpu_to_le32(VIRTIO_CRYPTO_SYM_OP_CIPHER);
sym_create_session->u.cipher.para.algo = ctrl->header.algo;
sym_create_session->u.cipher.para.keylen = cpu_to_le32(keylen);
sym_create_session->u.cipher.para.op = cpu_to_le32(op);

The new style shows more obviously:
- the variable we want to operate.
- an assignment statement in a single line.

Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Gonglei <arei.gonglei@huawei.com>
Reviewed-by: Gonglei <arei.gonglei@huawei.com>
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Message-Id: <20220506131627.180784-2-pizhenwei@bytedance.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

virtio-pci: Remove wrong address verification in vp_del_vqs()

GCC 12 enhanced -Waddress when comparing array address to null [0],
which warns:

    drivers/virtio/virtio_pci_common.c: In function ‘vp_del_vqs’:
    drivers/virtio/virtio_pci_common.c:257:29: warning: the comparison will always evaluate as ‘true’ for the pointer operand in ‘vp_dev->msix_affinity_masks + (sizetype)((long unsigned int)i * 256)’ must not be NULL [-Waddress]
      257 |                         if (vp_dev->msix_affinity_masks[i])
          |                             ^~~~~~

In fact, the verification is comparing the result of a pointer
arithmetic, the address "msix_affinity_masks + i", which will always
evaluate to true.

Under the hood, free_cpumask_var() calls kfree(), which is safe to pass
NULL, not requiring non-null verification.  So remove the verification
to make compiler happy (happy compiler, happy life).

[0] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102103

Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Message-Id: <20220415023002.49805-1-muriloo@linux.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Christophe de Dinechin <dinechin@redhat.com>

virtio: pci: Fix an error handling path in vp_modern_probe()

If an error occurs after a successful pci_request_selected_regions() call,
it should be undone by a corresponding pci_release_selected_regions() call,
as already done in vp_modern_remove().

Fixes: fd502729fbbf ("virtio-pci: introduce modern device module")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Message-Id: <237109725aad2c3c03d14549f777b1927c84b045.1648977064.git.christophe.jaillet@wanadoo.fr>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

vdpasim: control virtqueue support

This patch introduces the control virtqueue support for vDPA
simulator. This is a requirement for supporting advanced features like
multiqueue.

A requirement for control virtqueue is to isolate its memory access
from the rx/tx virtqueues. This is because when using vDPA device
for VM, the control virqueue is not directly assigned to VM. Userspace
(Qemu) will present a shadow control virtqueue to control for
recording the device states.

The isolation is done via the virtqueue groups and ASID support in
vDPA through vhost-vdpa. The simulator is extended to have:

1) three virtqueues: RXVQ, TXVQ and CVQ (control virtqueue)
2) two virtqueue groups: group 0 contains RXVQ and TXVQ; group 1
   contains CVQ
3) two address spaces and the simulator simply implements the address
   spaces by mapping it 1:1 to IOTLB.

For the VM use cases, userspace(Qemu) may set AS 0 to group 0 and AS 1
to group 1. So we have:

1) The IOTLB for virtqueue group 0 contains the mappings of guest, so
   RX and TX can be assigned to guest directly.
2) The IOTLB for virtqueue group 1 contains the mappings of CVQ which
   is the buffers that allocated and managed by VMM only. So CVQ of
   vhost-vdpa is visible to VMM only. And Guest can not access the CVQ
   of vhost-vdpa.

For the other use cases, since AS 0 is associated to all virtqueue
groups by default. All virtqueues share the same mapping by default.

To demonstrate the function, VIRITO_NET_F_CTRL_MACADDR is
implemented in the simulator for the driver to set mac address.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Gautam Dawar <gdawar@xilinx.com>
Message-Id: <20220330180436.24644-20-gdawar@xilinx.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

vdpa_sim: filter destination mac address

This patch implements a simple unicast filter for vDPA simulator.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Gautam Dawar <gdawar@xilinx.com>
Message-Id: <20220330180436.24644-19-gdawar@xilinx.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

vdpa_sim: factor out buffer completion logic

Wrap up common buffer completion logic in to vdpasim_net_complete

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Gautam Dawar <gdawar@xilinx.com>
Message-Id: <20220330180436.24644-18-gdawar@xilinx.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

vdpa_sim: advertise VIRTIO_NET_F_MTU

We've already reported maximum mtu via config space, so let's
advertise the feature.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Gautam Dawar <gdawar@xilinx.com>
Message-Id: <20220330180436.24644-17-gdawar@xilinx.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

vhost-vdpa: support ASID based IOTLB API

This patch extends the vhost-vdpa to support ASID based IOTLB API. The
vhost-vdpa device will allocated multiple IOTLBs for vDPA device that
supports multiple address spaces. The IOTLBs and vDPA device memory
mappings is determined and maintained through ASID.

Note that we still don't support vDPA device with more than one
address spaces that depends on platform IOMMU. This work will be done
by moving the IOMMU logic from vhost-vDPA to vDPA device driver.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Gautam Dawar <gdawar@xilinx.com>
Message-Id: <20220330180436.24644-16-gdawar@xilinx.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Includes fixup:

vhost-vdpa: Fix some error handling path in vhost_vdpa_process_iotlb_msg()

In the error paths introduced by the original patch, a mutex may be left locked.
Add the correct goto instead of a direct return.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Message-Id: <89ef0ae4c26ac3cfa440c71e97e392dcb328ac1b.1653227924.git.christophe.jaillet@wanadoo.fr>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

vhost-vdpa: introduce uAPI to set group ASID

Follows the vDPA support for associating ASID to a specific virtqueue
group. This patch adds a uAPI to support setting them from userspace.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Gautam Dawar <gdawar@xilinx.com>
Message-Id: <20220330180436.24644-15-gdawar@xilinx.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

vhost-vdpa: uAPI to get virtqueue group id

Follows the support for virtqueue group in vDPA. This patches
introduces uAPI to get the virtqueue group ID for a specific virtqueue
in vhost-vdpa.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Gautam Dawar <gdawar@xilinx.com>
Message-Id: <20220330180436.24644-14-gdawar@xilinx.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

vhost-vdpa: introduce uAPI to get the number of address spaces

This patch introduces the uAPI for getting the number of address
spaces supported by this vDPA device.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Gautam Dawar <gdawar@xilinx.com>
Message-Id: <20220330180436.24644-13-gdawar@xilinx.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

vhost-vdpa: introduce uAPI to get the number of virtqueue groups

Follows the vDPA support for multiple address spaces, this patch
introduce uAPI for the userspace to know the number of virtqueue
groups supported by the vDPA device.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Gautam Dawar <gdawar@xilinx.com>
Message-Id: <20220330180436.24644-12-gdawar@xilinx.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

vhost-vdpa: introduce asid based IOTLB

This patch converts the vhost-vDPA device to support multiple IOTLBs
tagged via ASID via hlist. This will be used for supporting multiple
address spaces in the following patches.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Gautam Dawar <gdawar@xilinx.com>
Message-Id: <20220330180436.24644-11-gdawar@xilinx.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

vhost: support ASID in IOTLB API

This patches allows userspace to send ASID based IOTLB message to
vhost. This idea is to use the reserved u32 field in the existing V2
IOTLB message. Vhost device should advertise this capability via
VHOST_BACKEND_F_IOTLB_ASID backend feature.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Gautam Dawar <gdawar@xilinx.com>
Message-Id: <20220330180436.24644-10-gdawar@xilinx.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

vhost_iotlb: split out IOTLB initialization

This patch splits out IOTLB initialization to make sure it could be
reused by external modules.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Gautam Dawar <gdawar@xilinx.com>
Message-Id: <20220330180436.24644-9-gdawar@xilinx.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

vdpa: introduce config operations for associating ASID to a virtqueue group

This patch introduces a new bus operation to allow the vDPA bus driver
to associate an ASID to a virtqueue group.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Gautam Dawar <gdawar@xilinx.com>
Message-Id: <20220330180436.24644-8-gdawar@xilinx.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

vdpa: multiple address spaces support

This patches introduces the multiple address spaces support for vDPA
device. This idea is to identify a specific address space via an
dedicated identifier - ASID.

During vDPA device allocation, vDPA device driver needs to report the
number of address spaces supported by the device then the DMA mapping
ops of the vDPA device needs to be extended to support ASID.

This helps to isolate the environments for the virtqueue that will not
be assigned directly. E.g in the case of virtio-net, the control
virtqueue will not be assigned directly to guest.

As a start, simply claim 1 virtqueue groups and 1 address spaces for
all vDPA devices. And vhost-vDPA will simply reject the device with
more than 1 virtqueue groups or address spaces.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Gautam Dawar <gdawar@xilinx.com>
Message-Id: <20220330180436.24644-7-gdawar@xilinx.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

vdpa: introduce virtqueue groups

This patch introduces virtqueue groups to vDPA device. The virtqueue
group is the minimal set of virtqueues that must share an address
space. And the address space identifier could only be attached to
a specific virtqueue group.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Gautam Dawar <gdawar@xilinx.com>
Message-Id: <20220330180436.24644-6-gdawar@xilinx.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

vhost-vdpa: switch to use vhost-vdpa specific IOTLB

To ease the implementation of per group ASID support for vDPA
device. This patch switches to use a vhost-vdpa specific IOTLB to
avoid the unnecessary refactoring of the vhost core.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Gautam Dawar <gdawar@xilinx.com>
Message-Id: <20220330180436.24644-5-gdawar@xilinx.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

vhost-vdpa: passing iotlb to IOMMU mapping helpers

To prepare for the ASID support for vhost-vdpa, try to pass IOTLB
object to dma helpers. No functional changes, it's just a preparation
for support multiple IOTLBs.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Gautam Dawar <gdawar@xilinx.com>
Message-Id: <20220330180436.24644-4-gdawar@xilinx.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

virtio-vdpa: don't set callback if virtio doesn't need it

There's no need for setting callbacks for the driver that doesn't care
about that.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Gautam Dawar <gdawar@xilinx.com>
Message-Id: <20220330180436.24644-3-gdawar@xilinx.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

vhost: move the backend feature bits to vhost_types.h

We should store feature bits in vhost_types.h as what has been done
for e.g VHOST_F_LOG_ALL.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Gautam Dawar <gdawar@xilinx.com>
Message-Id: <20220330180436.24644-2-gdawar@xilinx.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

virtio_ring: add unlikely annotation for free descs check

The 'if (vq->vq.num_free < descs_used)' check will almost always be false.

Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com>
Message-Id: <20220328105817.1028065-2-xianting.tian@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>

virtio_ring: remove unnecessary to_vvq call in vring hot path

It passes '_vq' to virtqueue_use_indirect(), which still calls
to_vvq to get 'vq', let's directly pass 'vq'. It can avoid
unnecessary call of to_vvq in hot path.

Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com>
Message-Id: <20220328105817.1028065-1-xianting.tian@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>

virtio-blk: support mq_ops->queue_rqs()

This patch supports mq_ops->queue_rqs() hook. It has an advantage of
batch submission to virtio-blk driver. It also helps polling I/O because
polling uses batched completion of block layer. Batch submission in
queue_rqs() can boost polling performance.

In queue_rqs(), it iterates plug->mq_list, collects requests that
belong to same HW queue until it encounters a request from other
HW queue or sees the end of the list.
Then, virtio-blk adds requests into virtqueue and kicks virtqueue
to submit requests.

If there is an error, it inserts error request to requeue_list and
passes it to ordinary block layer path.

For verification, I did fio test.
(io_uring, randread, direct=1, bs=4K, iodepth=64 numjobs=N)
I set 4 vcpu and 2 virtio-blk queues for VM and run fio test 5 times.
It shows about 2% improvement.

                                 |   numjobs=2   |   numjobs=4
      -----------------------------------------------------------
        fio without queue_rqs()  |   291K IOPS   |   238K IOPS
      -----------------------------------------------------------
        fio with queue_rqs()     |   295K IOPS   |   243K IOPS

For polling I/O performance, I also did fio test as below.
(io_uring, hipri, randread, direct=1, bs=512, iodepth=64 numjobs=4)
I set 4 vcpu and 2 poll queues for VM.
It shows about 2% improvement in polling I/O.

                                      |   IOPS   |  avg latency
      -----------------------------------------------------------
        fio poll without queue_rqs()  |   424K   |   613.05 usec
      -----------------------------------------------------------
        fio poll with queue_rqs()     |   435K   |   601.01 usec

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Suwan Kim <suwan.kim027@gmail.com>
Message-Id: <20220406153207.163134-3-suwan.kim027@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>

virtio-blk: support polling I/O

This patch supports polling I/O via virtio-blk driver. Polling
feature is enabled by module parameter "poll_queues" and it sets
dedicated polling queues for virtio-blk. This patch improves the
polling I/O throughput and latency.

The virtio-blk driver doesn't not have a poll function and a poll
queue and it has been operating in interrupt driven method even if
the polling function is called in the upper layer.

virtio-blk polling is implemented upon 'batched completion' of block
layer. virtblk_poll() queues completed request to io_comp_batch->req_list
and later, virtblk_complete_batch() calls unmap function and ends
the requests in batch.

virtio-blk reads the number of poll queues from module parameter
"poll_queues". If VM sets queue parameter as below,
("num-queues=N" [QEMU property], "poll_queues=M" [module parameter])
It allocates N virtqueues to virtio_blk->vqs[N] and it uses [0..(N-M-1)]
as default queues and [(N-M)..(N-1)] as poll queues. Unlike the default
queues, the poll queues have no callback function.

Regarding HW-SW queue mapping, the default queue mapping uses the
existing method that condsiders MSI irq vector. But the poll queue
doesn't have an irq, so it uses the regular blk-mq cpu mapping.

For verifying the improvement, I did Fio polling I/O performance test
with io_uring engine with the options below.
(io_uring, hipri, randread, direct=1, bs=512, iodepth=64 numjobs=N)
I set 4 vcpu and 4 virtio-blk queues - 2 default queues and 2 poll
queues for VM.

As a result, IOPS and average latency improved about 10%.

Test result:

- Fio io_uring poll without virtio-blk poll support
-- numjobs=1 : IOPS = 339K, avg latency = 188.33us
-- numjobs=2 : IOPS = 367K, avg latency = 347.33us
-- numjobs=4 : IOPS = 383K, avg latency = 682.06us

- Fio io_uring poll with virtio-blk poll support
-- numjobs=1 : IOPS = 385K, avg latency = 165.94us
-- numjobs=2 : IOPS = 408K, avg latency = 313.28us
-- numjobs=4 : IOPS = 424K, avg latency = 613.05us

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Suwan Kim <suwan.kim027@gmail.com>
Message-Id: <20220406153207.163134-2-suwan.kim027@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>

vdpa/mlx5: Use readers/writers semaphore instead of mutex

Reading statistics could be done intensively and by several processes
concurrently. Reader's lock is sufficient in this case.

Change reslock from mutex to a rwsem.

Suggested-by: Si-Wei Liu <si-wei.liu@oracle.com>
Signed-off-by: Eli Cohen <elic@nvidia.com>
Message-Id: <20220518133804.1075129-7-elic@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

vdpa/mlx5: Add support for reading descriptor statistics

Implement the get_vq_stats calback of vdpa_config_ops to return the
statistics for a virtqueue.

The statistics are provided as vendor specific statistics where the
driver provides a pair of attribute name and attribute value.

Currently supported are received descriptors and completed descriptors.

Signed-off-by: Eli Cohen <elic@nvidia.com>
Message-Id: <20220518133804.1075129-6-elic@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

net/vdpa: Use readers/writers semaphore instead of cf_mutex

Replace cf_mutex with rw_semaphore to reflect the fact that some calls
could be called concurrently but can suffice with read lock.

Suggested-by: Si-Wei Liu <si-wei.liu@oracle.com>
Signed-off-by: Eli Cohen <elic@nvidia.com>
Message-Id: <20220518133804.1075129-5-elic@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

net/vdpa: Use readers/writers semaphore instead of vdpa_dev_mutex

Use rw_semaphore instead of mutex to control access to vdpa devices.
This can be especially beneficial in case processes poll on statistics
information.

Suggested-by: Si-Wei Liu <si-wei.liu@oracle.com>
Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Eli Cohen <elic@nvidia.com>
Message-Id: <20220518133804.1075129-4-elic@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

vdpa: Add support for querying vendor statistics

Allows to read vendor statistics of a vdpa device. The specific
statistics data are received from the upstream driver in the form of an
(attribute name, attribute value) pairs.

An example of statistics for mlx5_vdpa device are:

received_desc - number of descriptors received by the virtqueue
completed_desc - number of descriptors completed by the virtqueue

A descriptor using indirect buffers is still counted as 1. In addition,
N chained descriptors are counted correctly N times as one would expect.

A new callback was added to vdpa_config_ops which provides the means for
the vdpa driver to return statistics results.

The interface allows for reading all the supported virtqueues, including
the control virtqueue if it exists.

Below are some examples taken from mlx5_vdpa which are introduced in the
following patch:

1. Read statistics for the virtqueue at index 1

$ vdpa dev vstats show vdpa-a qidx 1
vdpa-a:
queue_type tx queue_index 1 received_desc 3844836 completed_desc 3844836

2. Read statistics for the virtqueue at index 32
$ vdpa dev vstats show vdpa-a qidx 32
vdpa-a:
queue_type control_vq queue_index 32 received_desc 62 completed_desc 62

3. Read statisitics for the virtqueue at index 0 with json output
$ vdpa -j dev vstats show vdpa-a qidx 0
{"vstats":{"vdpa-a":{
"queue_type":"rx","queue_index":0,"name":"received_desc","value":417776,\
"name":"completed_desc","value":417548}}}

4. Read statistics for the virtqueue at index 0 with preety json output
$ vdpa -jp dev vstats show vdpa-a qidx 0
{
    "vstats": {
        "vdpa-a": {

            "queue_type": "rx",
            "queue_index": 0,
            "name": "received_desc",
            "value": 417776,
            "name": "completed_desc",
            "value": 417548
        }
    }
}

Signed-off-by: Eli Cohen <elic@nvidia.com>
Message-Id: <20220518133804.1075129-3-elic@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

vdpa: Fix error logic in vdpa_nl_cmd_dev_get_doit

In vdpa_nl_cmd_dev_get_doit(), if the call to genlmsg_reply() fails we
must not call nlmsg_free() since this is done inside genlmsg_reply().

Fix it.

Fixes: bc0d90ee021f ("vdpa: Enable user to query vdpa device info")
Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Eli Cohen <elic@nvidia.com>
Message-Id: <20220518133804.1075129-2-elic@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

Merge tag 'for-5.19/fbdev-1' of git://git./linux/kernel/git/deller/linux-fbdev

Pull fbdev fixes and updates from Helge Deller:
"A buch of small fixes and cleanups, including:

   - vesafb: Fix a use-after-free due early fb_info cleanup

   - clcdfb: Fix refcount leak in clcdfb_of_vram_setup

   - hyperv_fb: Allow resolutions with size > 64 MB for Gen1

   - pxa3xx-gcu: release the resources correctly in
     pxa3xx_gcu_probe/remove()

   - omapfb: Prevent compiler warning regarding
     hwa742_update_window_async()"

* tag 'for-5.19/fbdev-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev:
  video: fbdev: omap: Add prototype for hwa742_update_window_async()
  video: fbdev: vesafb: Fix a use-after-free due early fb_info cleanup
  video: fbdev: radeon: Fix spelling typo in comment
  video: fbdev: xen: remove setting of 'transp' parameter
  video: fbdev: pxa3xx-gcu: release the resources correctly in pxa3xx_gcu_probe/remove()
  video: fbdev: omapfb: simplify the return expression of nec_8048_connect()
  video: fbdev: omapfb: simplify the return expression of dsi_init_pll_data()
  video: fbdev: clcdfb: Fix refcount leak in clcdfb_of_vram_setup
  video: fbdev: hyperv_fb: Allow resolutions with size > 64 MB for Gen1

Merge tag 'for-5.19/parisc-1' of git://git./linux/kernel/git/deller/parisc-linux

Pull parisc architecture updates from Helge Deller:
"Minor cleanups and code optimizations, e.g.:

   - improvements in assembly statements in the tmpalias code path

   - added some additionals compile time checks

   - drop some unneccesary assembler DMA syncs"

* tag 'for-5.19/parisc-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: Drop __ARCH_WANT_OLD_READDIR and __ARCH_WANT_SYS_OLDUMOUNT
  parisc: Optimize tmpalias function calls
  parisc: Add dep_safe() macro to deposit a register in 32- and 64-kernels
  parisc: Fix wrong comment for shr macro
  parisc: Prevent ldil() to sign-extend into upper 32 bits
  parisc: Don't hardcode assembler bit definitions in tmpalias code
  parisc: Don't enforce DMA completion order in cache flushes
  parisc: video: fbdev: stifb: Add sti_dump_font() to dump STI font

Merge tag 'pm-5.19-rc1-2' of git://git./linux/kernel/git/rafael/linux-pm

Pull more power management updates from Rafael Wysocki:
"These update the ARM cpufreq drivers and fix up the CPPC cpufreq
  driver after recent changes, update the OPP code and PM documentation
  and add power sequences support to the system reboot and power off
  code.

  Specifics:

   - Add Tegra234 cpufreq support (Sumit Gupta)

   - Clean up and enhance the Mediatek cpufreq driver (Wan Jiabing,
     Rex-BC Chen, and Jia-Wei Chang)

   - Fix up the CPPC cpufreq driver after recent changes (Zheng Bin,
     Pierre Gondois)

   - Minor update to dt-binding for Qcom's opp-v2-kryo-cpu (Yassine
     Oudjana)

   - Use list iterator only inside the list_for_each_entry loop
     (Xiaomeng Tong, and Jakob Koschel)

   - New APIs related to finding OPP based on interconnect bandwidth
     (Krzysztof Kozlowski)

   - Fix the missing of_node_put() in _bandwidth_supported() (Dan
     Carpenter)

   - Cleanups (Krzysztof Kozlowski, and Viresh Kumar)

   - Add Out of Band mode description to the intel-speed-select utility
     documentation (Srinivas Pandruvada)

   - Add power sequences support to the system reboot and power off code
     and make related platform-specific changes for multiple platforms
     (Dmitry Osipenko, Geert Uytterhoeven)"

* tag 'pm-5.19-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (60 commits)
  cpufreq: CPPC: Fix unused-function warning
  cpufreq: CPPC: Fix build error without CONFIG_ACPI_CPPC_CPUFREQ_FIE
  Documentation: admin-guide: PM: Add Out of Band mode
  kernel/reboot: Change registration order of legacy power-off handler
  m68k: virt: Switch to new sys-off handler API
  kernel/reboot: Add devm_register_restart_handler()
  kernel/reboot: Add devm_register_power_off_handler()
  soc/tegra: pmc: Use sys-off handler API to power off Nexus 7 properly
  reboot: Remove pm_power_off_prepare()
  regulator: pfuze100: Use devm_register_sys_off_handler()
  ACPI: power: Switch to sys-off handler API
  memory: emif: Use kernel_can_power_off()
  mips: Use do_kernel_power_off()
  ia64: Use do_kernel_power_off()
  x86: Use do_kernel_power_off()
  sh: Use do_kernel_power_off()
  m68k: Switch to new sys-off handler API
  powerpc: Use do_kernel_power_off()
  xen/x86: Use do_kernel_power_off()
  parisc: Use do_kernel_power_off()
  ...

Merge tag 'thermal-5.19-rc1-2' of git://git./linux/kernel/git/rafael/linux-pm

Pull additional thermal control update from Rafael Wysocki:
"Add Meteor Lake PCI device ID to the int340x thermal control driver
(Sumeet Pawnikar)"

* tag 'thermal-5.19-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
thermal: int340x: Add Meteor Lake PCI device ID

Merge tag 'acpi-5.19-rc1-2' of git://git./linux/kernel/git/rafael/linux-pm

Pull more ACPI updates from Rafael Wysocki:
"These add some new device IDs, update a few drivers (processor,
  battery, backlight) and clean up code in a few places.

  Specifics:

   - Add Meteor Lake ACPI IDs for DPTF devices (Sumeet Pawnikar)

   - Rearrange find_child_checks() to simplify code (Rafael Wysocki)

   - Use memremap() to map the UCSI mailbox that is always in main
     memory and drop acpi_release_memory() that has no more users
     (Heikki Krogerus, Dan Carpenter)

   - Make max_cstate/nocst/bm_check_disable processor module parameters
     visible in sysfs (Yajun Deng)

   - Fix typo in the CPPC driver (Julia Lawall)

   - Make the ACPI battery driver show the "not-charging" status by
     default unless "charging" or "full" is directly indicated (Werner
     Sembach)

   - Improve the PM notifier in the ACPI backlight driver (Zhang Rui)

   - Clean up some white space in the ACPI code (Ian Cowan)"

* tag 'acpi-5.19-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  usb: typec: ucsi: acpi: fix a NULL vs IS_ERR() check in probe
  ACPI: DPTF: Support Meteor Lake
  ACPI: CPPC: fix typo in comment
  ACPI: video: improve PM notifer callback
  ACPI: clean up white space in a few places for consistency
  ACPI: glue: Rearrange find_child_checks()
  ACPI: processor: idle: Expose max_cstate/nocst/bm_check_disable read-only in sysfs
  ACPI: battery: Make "not-charging" the default on no charging or full info
  ACPI: OSL: Remove the helper for deactivating memory region
  usb: typec: ucsi: acpi: Map the mailbox with memremap()

Merge tag 'ovl-update-5.19' of git://git./linux/kernel/git/mszeredi/vfs

Pull overlayfs updates from Miklos Szeredi:

- Support idmapped layers in overlayfs (Christian Brauner)

- Add a fix to exportfs that is relevant to open_by_handle_at(2) as
   well

- Introduce new lookup helpers that allow passing mnt_userns into
   inode_permission()

* tag 'ovl-update-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: support idmapped layers
  ovl: handle idmappings in ovl_xattr_{g,s}et()
  ovl: handle idmappings in layer open helpers
  ovl: handle idmappings in ovl_permission()
  ovl: use ovl_copy_{real,upper}attr() wrappers
  ovl: store lower path in ovl_inode
  ovl: handle idmappings for layer lookup
  ovl: handle idmappings for layer fileattrs
  ovl: use ovl_path_getxattr() wrapper
  ovl: use ovl_lookup_upper() wrapper
  ovl: use ovl_do_notify_change() wrapper
  ovl: pass layer mnt to ovl_open_realfile()
  ovl: pass ofs to setattr operations
  ovl: handle idmappings in creation operations
  ovl: add ovl_upper_mnt_userns() wrapper
  ovl: pass ofs to creation operations
  ovl: use wrappers to all vfs_*xattr() calls
  exportfs: support idmapped mounts
  fs: add two trivial lookup helpers

Merge tag 'mips_5.19' of git://git./linux/kernel/git/mips/linux

Pull MIPS updates from Thomas Bogendoerfer:
"Cleanups and fixes"

* tag 'mips_5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: (38 commits)
  MIPS: RALINK: Define pci_remap_iospace under CONFIG_PCI_DRIVERS_GENERIC
  MIPS: Use memblock_add_node() in early_parse_mem() under CONFIG_NUMA
  MIPS: Return -EINVAL if mem parameter is empty in early_parse_mem()
  MIPS: Kconfig: Fix indentation and add endif comment
  MIPS: bmips: Fix compiler warning observed on W=1 build
  MIPS: Rewrite `csum_tcpudp_nofold' in plain C
  mips: setup: use strscpy to replace strlcpy
  MIPS: Octeon: add SNIC10E board
  MIPS: Ingenic: Refresh defconfig for CU1000-Neo and CU1830-Neo.
  MIPS: Ingenic: Refresh device tree for Ingenic SoCs and boards.
  MIPS: Ingenic: Add PWM nodes for X1830.
  MIPS: Octeon: fix typo in comment
  MIPS: loongson32: Kconfig: Remove extra space
  MIPS: Sibyte: remove unnecessary return variable
  MIPS: Use NOKPROBE_SYMBOL() instead of __kprobes annotation
  selftests/ftrace: Save kprobe_events to test log
  MIPS: tools: no need to initialise statics to 0
  MIPS: Loongson: Use hwmon_device_register_with_groups() to register hwmon
  MIPS: VR41xx: Drop redundant spinlock initialization
  MIPS: smp: optimization for flush_tlb_mm when exiting
  ...

Merge tag 'm68knommu-for-v5.19' of git://git./linux/kernel/git/gerg/m68knommu

Pull m68knommu updates from Greg Ungerer:
"A collection of changes to add elf-fdpic loader support for m68k.

  Also a collection of various fixes. They include typo corrections,
  undefined symbol compilation fixes, removal of the ISA_DMA_API support
  and removal of unused code.

  Summary:

   - correctly set up ZERO_PAGE pointer

   - drop ISA_DMA_API support

   - fix comment typos

   - fixes for undefined symbols

   - remove unused code and variables

   - elf-fdpic loader support for m68k"

* tag 'm68knommu-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
  m68knommu: fix 68000 CPU link with no platform selected
  m68k: removed unused "mach_get_ss"
  m68knommu: fix undefined reference to `mach_get_rtc_pll'
  m68knommu: fix undefined reference to `_init_sp'
  m68knommu: allow elf_fdpic loader to be selected
  m68knommu: add definitions to support elf_fdpic program loader
  m68knommu: implement minimal regset support
  m68knommu: use asm-generic/mmu.h for nommu setups
  m68k: fix typos in comments
  m68k: coldfire: drop ISA_DMA_API support
  m68knommu: set ZERO_PAGE() to the allocated zeroed page

Merge branches 'acpi-battery', 'acpi-video' and 'acpi-misc'

Merge ACPI battery and backlight driver update and miscellaneous
cleanup for 5.19-rc1:

- Make the ACPI battery driver show the "not-charging" status by
   default unless "charging" or "full" is directly indicated (Werner
   Sembach).

- Improve the PM notifier in the ACPI backlight driver (Zhang Rui).

- Clean up some white space in the ACPI code (Ian Cowan).

* acpi-battery:
  ACPI: battery: Make "not-charging" the default on no charging or full info

* acpi-video:
  ACPI: video: improve PM notifer callback

* acpi-misc:
  ACPI: clean up white space in a few places for consistency

Merge branches 'acpi-glue', 'acpi-osl', 'acpi-processor' and 'acpi-cppc'

Merge general ACPI cleanups and processor support updates for 5.19-rc1:

- Rearrange find_child_checks() to simplify code (Rafael Wysocki).

- Use memremap() to map the UCSI mailbox that is always in main memory
   and drop acpi_release_memory() that has no more users (Heikki
   Krogerus, Dan Carpenter).

- Make max_cstate/nocst/bm_check_disable processor module parameters
   visible in sysfs (Yajun Deng).

- Fix typo in the CPPC driver (Julia Lawall).

* acpi-glue:
  ACPI: glue: Rearrange find_child_checks()

* acpi-osl:
  usb: typec: ucsi: acpi: fix a NULL vs IS_ERR() check in probe
  ACPI: OSL: Remove the helper for deactivating memory region
  usb: typec: ucsi: acpi: Map the mailbox with memremap()

* acpi-processor:
  ACPI: processor: idle: Expose max_cstate/nocst/bm_check_disable read-only in sysfs

* acpi-cppc:
  ACPI: CPPC: fix typo in comment

usb: typec: ucsi: acpi: fix a NULL vs IS_ERR() check in probe

The devm_memremap() function never returns NULL. It returns error
pointers.

Fixes: cdc3d2abf438 ("usb: typec: ucsi: acpi: Map the mailbox with memremap()")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

parisc: Drop __ARCH_WANT_OLD_READDIR and __ARCH_WANT_SYS_OLDUMOUNT

Those old syscalls aren't exported via our syscall table, so just drop
them.

Signed-off-by: Helge Deller <deller@gmx.de>

Merge branch 'pm-sysoff'

Merge system power off handling rework from Dmitry Osipenko for
5.19-rc1.

This introduces a mechanism allowing power sequences to be used for
powering off the system and makes related changes in platform-specific
code for multiple platforms.

* pm-sysoff: (29 commits)
  kernel/reboot: Change registration order of legacy power-off handler
  m68k: virt: Switch to new sys-off handler API
  kernel/reboot: Add devm_register_restart_handler()
  kernel/reboot: Add devm_register_power_off_handler()
  soc/tegra: pmc: Use sys-off handler API to power off Nexus 7 properly
  reboot: Remove pm_power_off_prepare()
  regulator: pfuze100: Use devm_register_sys_off_handler()
  ACPI: power: Switch to sys-off handler API
  memory: emif: Use kernel_can_power_off()
  mips: Use do_kernel_power_off()
  ia64: Use do_kernel_power_off()
  x86: Use do_kernel_power_off()
  sh: Use do_kernel_power_off()
  m68k: Switch to new sys-off handler API
  powerpc: Use do_kernel_power_off()
  xen/x86: Use do_kernel_power_off()
  parisc: Use do_kernel_power_off()
  arm64: Use do_kernel_power_off()
  riscv: Use do_kernel_power_off()
  csky: Use do_kernel_power_off()
  ...

Merge branch 'pm-docs'

Merge PM documentation update for 5.19-rc1.

* pm-docs:
Documentation: admin-guide: PM: Add Out of Band mode

Merge branch 'pm-opp'

Merge OPP (Operating Performance Points) changes for 5.19-rc1:

- Minor update to dt-binding for Qcom's opp-v2-kryo-cpu (Yassine
   Oudjana).

- Use list iterator only inside the list_for_each_entry loop (Xiaomeng
   Tong, and Jakob Koschel).

- New APIs related to finding OPP based on interconnect bandwidth
   (Krzysztof Kozlowski).

- Fix the missing of_node_put() in _bandwidth_supported() (Dan
   Carpenter).

- Cleanups (Krzysztof Kozlowski, and Viresh Kumar).

* pm-opp:
  opp: Reorder definition of ceil/floor helpers
  opp: Add apis to retrieve opps with interconnect bandwidth
  dt-bindings: opp: opp-v2-kryo-cpu: Remove SMEM
  opp: use list iterator only inside the loop
  opp: replace usage of found with dedicated list iterator variable
  PM: opp: simplify with dev_err_probe()
  OPP: call of_node_put() on error path in _bandwidth_supported()

cpufreq: CPPC: Fix unused-function warning

Building the cppc_cpufreq driver with for arm64 with
CONFIG_ENERGY_MODEL=n triggers the following warnings:
drivers/cpufreq/cppc_cpufreq.c:550:12: error: ‘cppc_get_cpu_cost’ defined but not used
[-Werror=unused-function]
   550 | static int cppc_get_cpu_cost(struct device *cpu_dev, unsigned long KHz,
       |            ^~~~~~~~~~~~~~~~~
drivers/cpufreq/cppc_cpufreq.c:481:12: error: ‘cppc_get_cpu_power’ defined but not used
[-Werror=unused-function]
   481 | static int cppc_get_cpu_power(struct device *cpu_dev,
       |            ^~~~~~~~~~~~~~~~~~

Move the Energy Model related functions into specific guards.
This allows to fix the warning and prevent doing extra work
when the Energy Model is not present.

Fixes: 740fcdc2c20e ("cpufreq: CPPC: Register EM based on efficiency class information")
Reported-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
Tested-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: CPPC: Fix build error without CONFIG_ACPI_CPPC_CPUFREQ_FIE

If CONFIG_ACPI_CPPC_CPUFREQ_FIE is not set, building fails:

drivers/cpufreq/cppc_cpufreq.c: In function ‘populate_efficiency_class’:
drivers/cpufreq/cppc_cpufreq.c:584:2: error: ‘cppc_cpufreq_driver’ undeclared (first use in this function); did you mean ‘cpufreq_driver’?
  cppc_cpufreq_driver.register_em = cppc_cpufreq_register_em;
  ^~~~~~~~~~~~~~~~~~~
  cpufreq_driver

Make declare of cppc_cpufreq_driver out of CONFIG_ACPI_CPPC_CPUFREQ_FIE
to fix this.

Fixes: 740fcdc2c20e ("cpufreq: CPPC: Register EM based on efficiency class information")
Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Merge tag 'dmaengine-5.19-rc1' of git://git./linux/kernel/git/vkoul/dmaengine

Pull dmaengine updates from Vinod Koul:
"Nothing special, this includes a couple of new device support and new
  driver support and bunch of driver updates.

  New support:

   - Tegra gpcdma driver support

   - Qualcomm SM8350, Sm8450 and SC7280 device support

   - Renesas RZN1 dma and platform support

  Updates:

   - stm32 device pause/resume support and updates

   - DMA memset ops Documentation and usage clarification

   - deprecate '#dma-channels' & '#dma-requests' bindings

   - driver updates for stm32, ptdma idsx etc"

* tag 'dmaengine-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine: (87 commits)
  dmaengine: idxd: make idxd_wq_enable() return 0 if wq is already enabled
  dmaengine: sun6i: Add support for the D1 variant
  dmaengine: sun6i: Add support for 34-bit physical addresses
  dmaengine: sun6i: Do not use virt_to_phys
  dt-bindings: dma: sun50i-a64: Add compatible for D1
  dmaengine: tegra: Remove unused switch case
  dmaengine: tegra: Fix uninitialized variable usage
  dmaengine: stm32-dma: add device_pause/device_resume support
  dmaengine: stm32-dma: rename pm ops before dma pause/resume introduction
  dmaengine: stm32-dma: pass DMA_SxSCR value to stm32_dma_handle_chan_done()
  dmaengine: stm32-dma: introduce stm32_dma_sg_inc to manage chan->next_sg
  dmaengine: stm32-dmamux: avoid reset of dmamux if used by coprocessor
  dmaengine: qcom: gpi: Add support for sc7280
  dt-bindings: dma: pl330: Add power-domains
  dmaengine: stm32-mdma: use dev_dbg on non-busy channel spurious it
  dmaengine: stm32-mdma: fix chan initialization in stm32_mdma_irq_handler()
  dmaengine: stm32-mdma: remove GISR1 register
  dmaengine: ti: deprecate '#dma-channels'
  dmaengine: mmp: deprecate '#dma-channels'
  dmaengine: pxa: deprecate '#dma-channels' and '#dma-requests'
  ...

Merge tag 'trace-tools-v5.19' of git://git./linux/kernel/git/rostedt/linux-trace

Pull tracing tool updates from Steven Rostedt:

- Various clean ups and fixes to rtla (Real Time Linux Analysis)

* tag 'trace-tools-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  rtla: Remove procps-ng dependency
  rtla: Fix __set_sched_attr error message
  rtla: Minor grammar fix for rtla README
  rtla: Don't overwrite existing directory mode
  rtla: Avoid record NULL pointer dereference
  rtla/Makefile: Properly handle dependencies

Merge tag 'trace-v5.19' of git://git./linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
"The majority of the changes are for fixes and clean ups.

  Notable changes:

   - Rework trace event triggers code to be easier to interact with.

   - Support for embedding bootconfig with the kernel (as suppose to
     having it embedded in initram). This is useful for embedded boards
     without initram disks.

   - Speed up boot by parallelizing the creation of tracefs files.

   - Allow absolute ring buffer timestamps handle timestamps that use
     more than 59 bits.

   - Added new tracing clock "TAI" (International Atomic Time)

   - Have weak functions show up in available_filter_function list as:
     __ftrace_invalid_address___<invalid-offset> instead of using the
     name of the function before it"

* tag 'trace-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (52 commits)
  ftrace: Add FTRACE_MCOUNT_MAX_OFFSET to avoid adding weak function
  tracing: Fix comments for event_trigger_separate_filter()
  x86/traceponit: Fix comment about irq vector tracepoints
  x86,tracing: Remove unused headers
  ftrace: Clean up hash direct_functions on register failures
  tracing: Fix comments of create_filter()
  tracing: Disable kcov on trace_preemptirq.c
  tracing: Initialize integer variable to prevent garbage return value
  ftrace: Fix typo in comment
  ftrace: Remove return value of ftrace_arch_modify_*()
  tracing: Cleanup code by removing init "char *name"
  tracing: Change "char *" string form to "char []"
  tracing/timerlat: Do not wakeup the thread if the trace stops at the IRQ
  tracing/timerlat: Print stacktrace in the IRQ handler if needed
  tracing/timerlat: Notify IRQ new max latency only if stop tracing is set
  kprobes: Fix build errors with CONFIG_KRETPROBES=n
  tracing: Fix return value of trace_pid_write()
  tracing: Fix potential double free in create_var_ref()
  tracing: Use strim() to remove whitespace instead of doing it manually
  ftrace: Deal with error return code of the ftrace_process_locs() function
  ...

Merge tag 'perf-tools-for-v5.19-2022-05-28' of git://git./linux/kernel/git/acme/linux

Pull more perf tools updates from Arnaldo Carvalho de Melo:

- Add BPF based off-CPU profiling

- Improvements for system wide recording, specially for Intel PT

- Improve DWARF unwinding on arm64

- Support Arm CoreSight trace data disassembly in 'perf script' python

- Fix build with new libbpf version, related to supporting older
   versions of distro released libbpf packages

- Fix event syntax error caused by ExtSel in the JSON events infra

- Use stdio interface if slang is not supported in 'perf c2c'

- Add 'perf test' checking for perf stat CSV output

- Sync the msr-index.h copy with the kernel sources

* tag 'perf-tools-for-v5.19-2022-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (38 commits)
  tools arch x86: Sync the msr-index.h copy with the kernel sources
  perf scripts python: Support Arm CoreSight trace data disassembly
  perf scripting python: Expose dso and map information
  perf jevents: Fix event syntax error caused by ExtSel
  perf tools arm64: Add support for VG register
  perf unwind arm64: Decouple Libunwind register names from Perf
  perf unwind: Use dynamic register set for DWARF unwind
  perf tools arm64: Copy perf_regs.h from the kernel
  perf unwind arm64: Use perf's copy of kernel headers
  perf c2c: Use stdio interface if slang is not supported
  perf test: Add a basic offcpu profiling test
  perf record: Add cgroup support for off-cpu profiling
  perf record: Handle argument change in sched_switch
  perf record: Implement basic filtering for off-cpu
  perf record: Enable off-cpu analysis with BPF
  perf report: Do not extend sample type of bpf-output event
  perf test: Add checking for perf stat CSV output.
  perf tools: Allow system-wide events to keep their own threads
  perf tools: Allow system-wide events to keep their own CPUs
  libperf evsel: Add comments for booleans
  ...

video: fbdev: omap: Add prototype for hwa742_update_window_async()

The symbol hwa742_update_window_async() is exported, but there is no
prototype defined for it. That's why gcc complains:

drivers-video-fbdev-omap-hwa742.c:warning:no-previous-prototype-for-hwa742_update_window_async

Add the prototype, but I wonder if we couldn't drop exporting the symbol
instead. Since omapfb_update_window_async() is exported the same way,
are there any users outside of the tree?

Signed-off-by: Helge Deller <deller@gmx.de>

Merge tag 'input-for-v5.19-rc0' of git://git./linux/kernel/git/dtor/input

Pull input updates from Dmitry Torokhov:

- a new driver for the Azoteq IQS7222A/B/C capacitive touch controller

- a new driver for Raspberry Pi Sense HAT joystick

- sun4i-lradc-keys gained support of R329 and D1 variants, plus it can
   be now used as a wakeup source

- pm8941-pwrkey can now properly handle PON GEN3 variants; the driver
   also implements software debouncing and has a workaround for missing
   key press events

- assorted driver fixes and cleanups

* tag 'input-for-v5.19-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (29 commits)
  Input: stmfts - do not leave device disabled in stmfts_input_open
  Input: gpio-keys - cancel delayed work only in case of GPIO
  Input: cypress_ps2 - fix typo in comment
  Input: vmmouse - disable vmmouse before entering suspend mode
  dt-bindings: google,cros-ec-keyb: Fixup bad compatible match
  Input: cros-ec-keyb - allow skipping keyboard registration
  dt-bindings: google,cros-ec-keyb: Introduce switches only compatible
  Input: psmouse-smbus - avoid flush_scheduled_work() usage
  Input: bcm-keypad - remove unneeded NULL check before clk_disable_unprepare
  Input: sparcspkr - fix refcount leak in bbc_beep_probe
  Input: sun4i-lradc-keys - add support for R329 and D1
  Input: sun4i-lradc-keys - add optional clock/reset support
  dt-bindings: input: sun4i-lradc-keys: Add R329 and D1 compatibles
  Input: sun4i-lradc-keys - add wakeup support
  Input: pm8941-pwrkey - simulate missed key press events
  Input: pm8941-pwrkey - add software key press debouncing support
  Input: pm8941-pwrkey - add support for PON GEN3 base addresses
  Input: pm8941-pwrkey - fix error message
  Input: synaptics-rmi4 - remove unnecessary flush_workqueue()
  Input: ep93xx_keypad - use devm_platform_ioremap_resource() helper
  ...

drm: fix EDID struct for old ARM OABI format

When building the kernel for arm with the "-mabi=apcs-gnu" option, gcc
will force alignment of all structures and unions to a word boundary
(see also STRUCTURE_SIZE_BOUNDARY and the "-mstructure-size-boundary=XX"
option if you're a gcc person), even when the members of said structures
do not want or need said alignment.

This completely messes up the structure alignment of 'struct edid' on
those targets, because even though all the embedded structures are
marked with "__attribute__((packed))", the unions that contain them are
not.

This was exposed by commit f1e4c916f97f ("drm/edid: add EDID block count
and size helpers"), but the bug is pre-existing.  That commit just made
the structure layout problem cause a build failure due to the addition
of the

        BUILD_BUG_ON(sizeof(*edid) != EDID_LENGTH);

sanity check in drivers/gpu/drm/drm_edid.c:edid_block_data().

This legacy union alignment should probably not be used in the first
place, but we can fix the layout by adding the packed attribute to the
union entries even when each member is already packed and it shouldn't
matter in a sane build environment.

You can see this issue with a trivial test program:

  union {
struct {
char c[5];
};
struct {
char d;
unsigned e;
} __attribute__((packed));
  } a = { "1234" };

where building this with a normal "gcc -S" will result in the expected
5-byte size of said union:

.type a, @object
.size a, 5

but with an ARM compiler and the old ABI:

    arm-linux-gnu-gcc -mabi=apcs-gnu -mfloat-abi=soft -S t.c

you get

.type a, %object
.size a, 8

instead, because even though each member of the union is packed, the
union itself still gets aligned.

This was reported by Sudip for the spear3xx_defconfig target.

Link: https://lore.kernel.org/lkml/YpCUzStDnSgQLNFN@debian/
Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'hyperv-next-signed-20220528' of git://git./linux/kernel/git/hyperv/linux

Pull hyperv updates from Wei Liu:

- Harden hv_sock driver (Andrea Parri)

- Harden Hyper-V PCI driver (Andrea Parri)

- Fix multi-MSI for Hyper-V PCI driver (Jeffrey Hugo)

- Fix Hyper-V PCI to reduce boot time (Dexuan Cui)

- Remove code for long EOL'ed Hyper-V versions (Michael Kelley, Saurabh
   Sengar)

- Fix balloon driver error handling (Shradha Gupta)

- Fix a typo in vmbus driver (Julia Lawall)

- Ignore vmbus IMC device (Michael Kelley)

- Add a new error message to Hyper-V DRM driver (Saurabh Sengar)

* tag 'hyperv-next-signed-20220528' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (28 commits)
  hv_balloon: Fix balloon_probe() and balloon_remove() error handling
  scsi: storvsc: Removing Pre Win8 related logic
  Drivers: hv: vmbus: fix typo in comment
  PCI: hv: Fix synchronization between channel callback and hv_pci_bus_exit()
  PCI: hv: Add validation for untrusted Hyper-V values
  PCI: hv: Fix interrupt mapping for multi-MSI
  PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()
  drm/hyperv: Remove support for Hyper-V 2008 and 2008R2/Win7
  video: hyperv_fb: Remove support for Hyper-V 2008 and 2008R2/Win7
  scsi: storvsc: Remove support for Hyper-V 2008 and 2008R2/Win7
  Drivers: hv: vmbus: Remove support for Hyper-V 2008 and Hyper-V 2008R2/Win7
  x86/hyperv: Disable hardlockup detector by default in Hyper-V guests
  drm/hyperv: Add error message for fb size greater than allocated
  PCI: hv: Do not set PCI_COMMAND_MEMORY to reduce VM boot time
  PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI
  Drivers: hv: vmbus: Refactor the ring-buffer iterator functions
  Drivers: hv: vmbus: Accept hv_sock offers in isolated guests
  hv_sock: Add validation for untrusted Hyper-V values
  hv_sock: Copy packets sent by Hyper-V out of the ring buffer
  hv_sock: Check hv_pkt_iter_first_raw()'s return value
  ...

Merge tag 'powerpc-5.19-1' of git://git./linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:

- Convert to the generic mmap support (ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT)

- Add support for outline-only KASAN with 64-bit Radix MMU (P9 or later)

- Increase SIGSTKSZ and MINSIGSTKSZ and add support for AT_MINSIGSTKSZ

- Enable the DAWR (Data Address Watchpoint) on POWER9 DD2.3 or later

- Drop support for system call instruction emulation

- Many other small features and fixes

Thanks to Alexey Kardashevskiy, Alistair Popple, Andy Shevchenko, Bagas
Sanjaya, Bjorn Helgaas, Bo Liu, Chen Huang, Christophe Leroy, Colin Ian
King, Daniel Axtens, Dwaipayan Ray, Fabiano Rosas, Finn Thain, Frank
Rowand, Fuqian Huang, Guilherme G. Piccoli, Hangyu Hua, Haowen Bai,
Haren Myneni, Hari Bathini, He Ying, Jason Wang, Jiapeng Chong, Jing
Yangyang, Joel Stanley, Julia Lawall, Kajol Jain, Kevin Hao, Krzysztof
Kozlowski, Laurent Dufour, Lv Ruyi, Madhavan Srinivasan, Magali Lemes,
Miaoqian Lin, Minghao Chi, Nathan Chancellor, Naveen N. Rao, Nicholas
Piggin, Oliver O'Halloran, Oscar Salvador, Pali Rohár, Paul Mackerras,
Peng Wu, Qing Wang, Randy Dunlap, Reza Arbab, Russell Currey, Sohaib
Mohamed, Vaibhav Jain, Vasant Hegde, Wang Qing, Wang Wensheng, Xiang
wangx, Xiaomeng Tong, Xu Wang, Yang Guang, Yang Li, Ye Bin, YueHaibing,
Yu Kuai, Zheng Bin, Zou Wei, and Zucheng Zheng.

* tag 'powerpc-5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (200 commits)
  powerpc/64: Include cache.h directly in paca.h
  powerpc/64s: Only set HAVE_ARCH_UNMAPPED_AREA when CONFIG_PPC_64S_HASH_MMU is set
  powerpc/xics: Include missing header
  powerpc/powernv/pci: Drop VF MPS fixup
  powerpc/fsl_book3e: Don't set rodata RO too early
  powerpc/microwatt: Add mmu bits to device tree
  powerpc/powernv/flash: Check OPAL flash calls exist before using
  powerpc/powermac: constify device_node in of_irq_parse_oldworld()
  powerpc/powermac: add missing g5_phy_disable_cpu1() declaration
  selftests/powerpc/pmu: fix spelling mistake "mis-match" -> "mismatch"
  powerpc: Enable the DAWR on POWER9 DD2.3 and above
  powerpc/64s: Add CPU_FTRS_POWER10 to ALWAYS mask
  powerpc/64s: Add CPU_FTRS_POWER9_DD2_2 to CPU_FTRS_ALWAYS mask
  powerpc: Fix all occurences of "the the"
  selftests/powerpc/pmu/ebb: remove fixed_instruction.S
  powerpc/platforms/83xx: Use of_device_get_match_data()
  powerpc/eeh: Drop redundant spinlock initialization
  powerpc/iommu: Add missing of_node_put in iommu_init_early_dart
  powerpc/pseries/vas: Call misc_deregister if sysfs init fails
  powerpc/papr_scm: Fix leaking nvdimm_events_map elements
  ...

Merge tag 'pinctrl-v5.19-1' of git://git./linux/kernel/git/linusw/linux-pinctrl

Pull pin control updates from Linus Walleij:
"Pretty big this time. Mostly due to (nice) Renesas refactorings.

  Core changes:

   - New helpers from Andy such as for_each_gpiochip_node() affecting
     both GPIO and pin control, improving a bunch of drivers in the
     process.

   - Pulled in Marc Zyngiers work to make IRQ chips immutable, and
     started to apply fixups on top.

  New drivers:

   - New driver for Marvell MVEBU 98DX2530.

   - New driver for Mediatek MT8195.

   - Support Qualcomm PMX65 and PM6125.

   - New driver for Qualcomm SC7280 LPASS pin control.

   - New driver for Rockchip RK3588.

   - New driver for NXP Freescale i.MXRT1170.

   - New driver for Mediatek MT6795 Helio X10.

  Improvements:

   - Several Aspeed G6 cleanups and non-critical fixes.

   - Thorought refactoring of some of the ever improving Renesas
     drivers.

   - Clean up Mediatek MT8192 bindings a bit.

   - PWM output and clock monitoring in the Ocelot LAN966x driver.

   - Thorough refactoring and cleanup of the Ralink drivers such as
     RT2880, RT3883, RT305X, MT7620, MT7621, MT7628 splitting these into
     proper sub-drivers"

* tag 'pinctrl-v5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (161 commits)
  pinctrl: apple: Use a raw spinlock for the regmap
  pinctrl: berlin: bg4ct: Use devm_platform_*ioremap_resource() APIs
  pinctrl: intel: Fix kernel doc format, i.e. add return sections
  dt-bindings: pinctrl: qcom: Drop 'maxItems' on 'wakeup-parent'
  pinctrl: starfive: Make the irqchip immutable
  pinctrl: mediatek: Add pinctrl driver for MT6795 Helio X10
  dt-bindings: pinctrl: Add MediaTek MT6795 pinctrl bindings
  pinctrl: freescale: Add i.MXRT1170 pinctrl driver support
  dt-bindings: pinctrl: add i.MXRT1170 pinctrl Documentation
  dt-bindings: pinctrl: rockchip: increase max amount of device functions
  dt-bindings: pinctrl: qcom,pmic-gpio: add 'gpio-reserved-ranges'
  dt-bindings: pinctrl: qcom,pmic-gpio: add 'input-disable'
  dt-bindings: pinctrl: qcom,pmic-gpio: describe gpio-line-names
  dt-bindings: pinctrl: qcom,pmic-gpio: fix matching pin config
  dt-bindings: pinctrl: qcom,pmic-gpio: document PM8150L and PMM8155AU
  pinctrl: qcom: spmi-gpio: Add pm6125 compatible
  dt-bindings: pinctrl: qcom-pmic-gpio: Add pm6125 compatible
  pinctrl: intel: Drop unused irqchip member in struct intel_pinctrl
  pinctrl: intel: make irq_chip immutable
  pinctrl: cherryview: Use GPIO chip pointer in chv_gpio_irq_mask_unmask()
  ...

video: fbdev: vesafb: Fix a use-after-free due early fb_info cleanup

Commit b3c9a924aab6 ("fbdev: vesafb: Cleanup fb_info in .fb_destroy rather
than .remove") fixed a use-after-free error due the vesafb driver freeing
the fb_info in the .remove handler instead of doing it in .fb_destroy.

This can happen if the .fb_destroy callback is executed after the .remove
callback, since the former tries to access a pointer freed by the latter.

But that change didn't take into account that another possible scenario is
that .fb_destroy is called before the .remove callback. For example, if no
process has the fbdev chardev opened by the time the driver is removed.

If that's the case, fb_info will be freed when unregister_framebuffer() is
called, making the fb_info pointer accessed in vesafb_remove() after that
to no longer be valid.

To prevent that, move the expression containing the info->par to happen
before the unregister_framebuffer() function call.

Fixes: b3c9a924aab6 ("fbdev: vesafb: Cleanup fb_info in .fb_destroy rather than .remove")
Reported-by: Pascal Ernster <dri-devel@hardfalcon.net>
Signed-off-by: Javier Martinez Canillas <javierm@redhat.com>
Tested-by: Pascal Ernster <dri-devel@hardfalcon.net>
Signed-off-by: Helge Deller <deller@gmx.de>

Revert "crypto: poly1305 - cleanup stray CRYPTO_LIB_POLY1305_RSIZE"

This reverts commit 8bdc2a190105e862dfe7a4033f2fd385b7e58ae8.

It got merged a bit prematurely and shortly after the kernel test robot
and Sudip pointed out build failures:

  arm: imx_v6_v7_defconfig and multi_v7_defconfig
  mips: decstation_64_defconfig, decstation_defconfig, decstation_r4k_defconfig

  In file included from crypto/chacha20poly1305.c:13:
  include/crypto/poly1305.h:56:46: error: 'CONFIG_CRYPTO_LIB_POLY1305_RSIZE' undeclared here (not in a function); did you mean 'CONFIG_CRYPTO_POLY1305_MODULE'?
     56 |                 struct poly1305_key opaque_r[CONFIG_CRYPTO_LIB_POLY1305_RSIZE];
        |                                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We could attempt to fix this by listing the dependencies piecemeal, but
it's not as obvious as it looks: drivers like caam use this macro in
headers even if there's no .o compiled in that makes use of it.  So
actually fixing this might require a bit more of a comprehensive
approach, rather than whack-a-mole with hunting down which drivers use
which headers which use this macro.

Therefore, this commit just reverts the change, and maybe the problem
can be visited on the next rainy day.

Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Fixes: 8bdc2a190105 ("crypto: poly1305 - cleanup stray CRYPTO_LIB_POLY1305_RSIZE")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ftrace: Add FTRACE_MCOUNT_MAX_OFFSET to avoid adding weak function

If an unused weak function was traced, it's call to fentry will still
exist, which gets added into the __mcount_loc table. Ftrace will use
kallsyms to retrieve the name for each location in __mcount_loc to display
it in the available_filter_functions and used to enable functions via the
name matching in set_ftrace_filter/notrace. Enabling these functions do
nothing but enable an unused call to ftrace_caller. If a traced weak
function is overridden, the symbol of the function would be used for it,
which will either created duplicate names, or if the previous function was
not traced, it would be incorrectly be listed in available_filter_functions
as a function that can be traced.

This became an issue with BPF[1] as there are tooling that enables the
direct callers via ftrace but then checks to see if the functions were
actually enabled. The case of one function that was marked notrace, but
was followed by an unused weak function that was traced. The unused
function's call to fentry was added to the __mcount_loc section, and
kallsyms retrieved the untraced function's symbol as the weak function was
overridden. Since the untraced function would not get traced, the BPF
check would detect this and fail.

The real fix would be to fix kallsyms to not show addresses of weak
functions as the function before it. But that would require adding code in
the build to add function size to kallsyms so that it can know when the
function ends instead of just using the start of the next known symbol.

In the mean time, this is a work around. Add a FTRACE_MCOUNT_MAX_OFFSET
macro that if defined, ftrace will ignore any function that has its call
to fentry/mcount that has an offset from the symbol that is greater than
FTRACE_MCOUNT_MAX_OFFSET.

If CONFIG_HAVE_FENTRY is defined for x86, define FTRACE_MCOUNT_MAX_OFFSET
to zero (unless IBT is enabled), which will have ftrace ignore all locations
that are not at the start of the function (or one after the ENDBR
instruction).

A worker thread is added at boot up to scan all the ftrace record entries,
and will mark any that fail the FTRACE_MCOUNT_MAX_OFFSET test as disabled.
They will still appear in the available_filter_functions file as:

__ftrace_invalid_address___<invalid-offset>

(showing the offset that caused it to be invalid).

This is required for tools that use libtracefs (like trace-cmd does) that
scan the available_filter_functions and enable set_ftrace_filter and
set_ftrace_notrace using indexes of the function listed in the file (this
is a speedup, as enabling thousands of files via names is an O(n^2)
operation and can take minutes to complete, where the indexing takes less
than a second).

The invalid functions cannot be removed from available_filter_functions as
the names there correspond to the ftrace records in the array that manages
them (and the indexing depends on this).

[1] https://lore.kernel.org/all/20220412094923.0abe90955e5db486b7bca279@kernel.org/

Link: https://lkml.kernel.org/r/20220526141912.794c2786@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

Merge tag 'cxl-for-5.19' of git://git./linux/kernel/git/cxl/cxl

Pull cxl updates from Dan Williams:
"Compute Express Link (CXL) updates for this cycle.

  The highlight is new driver-core infrastructure and CXL subsystem
  changes for allowing lockdep to validate device_lock() usage. Thanks
  to PeterZ for setting me straight on the current capabilities of the
  lockdep API, and Greg acked it as well.

  On the CXL ACPI side this update adds support for CXL _OSC so that
  platform firmware knows that it is safe to still grant Linux native
  control of PCIe hotplug and error handling in the presence of CXL
  devices. A circular dependency problem was discovered between suspend
  and CXL memory for cases where the suspend image might be stored in
  CXL memory where that image also contains the PCI register state to
  restore to re-enable the device. Disable suspend for now until an
  architecture is defined to clarify that conflict.

  Lastly a collection of reworks, fixes, and cleanups to the CXL
  subsystem where support for snooping mailbox commands and properly
  handling the "mem_enable" flow are the highlights.

  Summary:

   - Add driver-core infrastructure for lockdep validation of
     device_lock(), and fixup a deadlock report that was previously
     hidden behind the 'lockdep no validate' policy.

   - Add CXL _OSC support for claiming native control of CXL hotplug and
     error handling.

   - Disable suspend in the presence of CXL memory unless and until a
     protocol is identified for restoring PCI device context from memory
     hosted on CXL PCI devices.

   - Add support for snooping CXL mailbox commands to protect against
     inopportune changes, like set-partition with the 'immediate' flag
     set.

   - Rework how the driver detects legacy CXL 1.1 configurations (CXL
     DVSEC / 'mem_enable') before enabling new CXL 2.0 decode
     configurations (CXL HDM Capability).

   - Miscellaneous cleanups and fixes from -next exposure"

* tag 'cxl-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (47 commits)
  cxl/port: Enable HDM Capability after validating DVSEC Ranges
  cxl/port: Reuse 'struct cxl_hdm' context for hdm init
  cxl/port: Move endpoint HDM Decoder Capability init to port driver
  cxl/pci: Drop @info argument to cxl_hdm_decode_init()
  cxl/mem: Merge cxl_dvsec_ranges() and cxl_hdm_decode_init()
  cxl/mem: Skip range enumeration if mem_enable clear
  cxl/mem: Consolidate CXL DVSEC Range enumeration in the core
  cxl/pci: Move cxl_await_media_ready() to the core
  cxl/mem: Validate port connectivity before dvsec ranges
  cxl/mem: Fix cxl_mem_probe() error exit
  cxl/pci: Drop wait_for_valid() from cxl_await_media_ready()
  cxl/pci: Consolidate wait_for_media() and wait_for_media_ready()
  cxl/mem: Drop mem_enabled check from wait_for_media()
  nvdimm: Fix firmware activation deadlock scenarios
  device-core: Kill the lockdep_mutex
  nvdimm: Drop nd_device_lock()
  ACPI: NFIT: Drop nfit_device_lock()
  nvdimm: Replace lockdep_mutex with local lock classes
  cxl: Drop cxl_device_lock()
  cxl/acpi: Add root device lockdep validation
  ...

Merge tag 'clang-format-for-linus-v5.19-rc1' of https://github.com/ojeda/linux

Pull clang-format updates from Miguel Ojeda:
"clang-format modernization and cleanups.

  A few changes from Brian Norris and Mickaël Salaün to start taking
  advantage of some clang-format 11 features, plus a few cleanups and
  the usual update of the macro list"

* tag 'clang-format-for-linus-v5.19-rc1' of https://github.com/ojeda/linux:
  clang-format: Fix space after for_each macros
  clang-format: Fix goto labels indentation
  clang-format: Update to clang-format >= 6
  clang-format: Extend the for_each list with tools/
  clang-format: Simplify command with `sort -u`
  clang-format: Use POSIX locale for `sort`
  clang-format: Update with v5.18-rc7's `for_each` macro list

Merge tag 'v5.19-p1' of git://git./linux/kernel/git/herbert/crypto-2.6

Pull crypto updates from Herbert Xu:
"API:

   - Test in-place en/decryption with two sglists in testmgr

   - Fix process vs softirq race in cryptd

  Algorithms:

   - Add arm64 acceleration for sm4

   - Add s390 acceleration for chacha20

  Drivers:

   - Add polarfire soc hwrng support in mpsf

   - Add support for TI SoC AM62x in sa2ul

   - Add support for ATSHA204 cryptochip in atmel-sha204a

   - Add support for PRNG in caam

   - Restore support for storage encryption in qat

   - Restore support for storage encryption in hisilicon/sec"

* tag 'v5.19-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (116 commits)
  hwrng: omap3-rom - fix using wrong clk_disable() in omap_rom_rng_runtime_resume()
  crypto: hisilicon/sec - delete the flag CRYPTO_ALG_ALLOCATES_MEMORY
  crypto: qat - add support for 401xx devices
  crypto: qat - re-enable registration of algorithms
  crypto: qat - honor CRYPTO_TFM_REQ_MAY_SLEEP flag
  crypto: qat - add param check for DH
  crypto: qat - add param check for RSA
  crypto: qat - remove dma_free_coherent() for DH
  crypto: qat - remove dma_free_coherent() for RSA
  crypto: qat - fix memory leak in RSA
  crypto: qat - add backlog mechanism
  crypto: qat - refactor submission logic
  crypto: qat - use pre-allocated buffers in datapath
  crypto: qat - set to zero DH parameters before free
  crypto: s390 - add crypto library interface for ChaCha20
  crypto: talitos - Uniform coding style with defined variable
  crypto: octeontx2 - simplify the return expression of otx2_cpt_aead_cbc_aes_sha_setkey()
  crypto: cryptd - Protect per-CPU resource by disabling BH.
  crypto: sun8i-ce - do not fallback if cryptlen is less than sg length
  crypto: sun8i-ce - rework debugging
  ...

Merge tag '5.19-rc-smb3-client-fixes-updated' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs client updates from Steve French:

- multichannel fixes to improve reconnect after network failure

- improved caching of root directory contents (extending benefit of
   directory leases)

- two DFS fixes

- three fixes for improved debugging

- an NTLMSSP fix for mounts t0 older servers

- new mount parm to allow disabling creating sparse files

- various cleanup fixes and minor fixes pointed out by coverity

* tag '5.19-rc-smb3-client-fixes-updated' of git://git.samba.org/sfrench/cifs-2.6: (24 commits)
  smb3: remove unneeded null check in cifs_readdir
  cifs: fix ntlmssp on old servers
  cifs: cache the dirents for entries in a cached directory
  cifs: avoid parallel session setups on same channel
  cifs: use new enum for ses_status
  cifs: do not use tcpStatus after negotiate completes
  smb3: add mount parm nosparse
  smb3: don't set rc when used and unneeded in query_info_compound
  smb3: check for null tcon
  cifs: fix minor compile warning
  Add various fsctl structs
  Add defines for various newer FSCTLs
  smb3: add trace point for oplock not found
  cifs: return the more nuanced writeback error on close()
  smb3: add trace point for lease not found issue
  cifs: smbd: fix typo in comment
  cifs: set the CREATE_NOT_FILE when opening the directory in use_cached_dir()
  cifs: check for smb1 in open_cached_dir()
  cifs: move definition of cifs_fattr earlier in cifsglob.h
  cifs: print TIDs as hex
  ...

Merge tag 'jfs-5.19' of https://github.com/kleikamp/linux-shaggy

Pull jfs updates from David Kleikamp:
"One bug fix and some code cleanup"

* tag 'jfs-5.19' of https://github.com/kleikamp/linux-shaggy:
fs/jfs: Remove dead code
fs: jfs: fix possible NULL pointer dereference in dbFree()

Merge tag 'libnvdimm-for-5.19' of git://git./linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm and DAX updates from Dan Williams:
"New support for clearing memory errors when a file is in DAX mode,
  alongside with some other fixes and cleanups.

  Previously it was only possible to clear these errors using a truncate
  or hole-punch operation to trigger the filesystem to reallocate the
  block, now, any page aligned write can opportunistically clear errors
  as well.

  This change spans x86/mm, nvdimm, and fs/dax, and has received the
  appropriate sign-offs. Thanks to Jane for her work on this.

  Summary:

   - Add support for clearing memory error via pwrite(2) on DAX

   - Fix 'security overwrite' support in the presence of media errors

   - Miscellaneous cleanups and fixes for nfit_test (nvdimm unit tests)"

* tag 'libnvdimm-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  pmem: implement pmem_recovery_write()
  pmem: refactor pmem_clear_poison()
  dax: add .recovery_write dax_operation
  dax: introduce DAX_RECOVERY_WRITE dax access mode
  mce: fix set_mce_nospec to always unmap the whole page
  x86/mce: relocate set{clear}_mce_nospec() functions
  acpi/nfit: rely on mce->misc to determine poison granularity
  testing: nvdimm: asm/mce.h is not needed in nfit.c
  testing: nvdimm: iomap: make __nfit_test_ioremap a macro
  nvdimm: Allow overwrite in the presence of disabled dimms
  tools/testing/nvdimm: remove unneeded flush_workqueue

Merge branch 'next' into for-linus

Prepare input updates for 5.19 merge window.

Merge tag 'mfd-next-5.19' of git://git./linux/kernel/git/lee/mfd

Pull MFD updates from Lee Jones:
"New Device Support
   - Add support for {Power,Home} Keys to MediaTek MT6359
   - Add support for SC2730 to Spreadtrum SPRD SC27XX SPI
   - Add support for additional Alder Lake-P I2C Controllers to Intel
     LPSS PCI

  Fix-ups:
   - Convert GPIO to GPIOD (hi655x-pmic)
   - Only register devices that exist (cros_ec_dev)
   - Remove unused code (syscon, reg-mux)
   - Rework .remove() API to return void (twl-core, rt4831)
   - Trivial - whitespace, spelling, coding style (tps65218,
     sprd-sc27xx-spi, google,cros-ec)
   - DT binding changes (samsung,exynos5433-lpass, rockchip,rk805,
     rockchip,rk808, rockchip,rk809, rockchip,rk817, rockchip,rk818,
     wlf,arizona)

  Bug Fixes:
   - Fix error handling bugs (ipaq-micro, davinci_voicecodec)"

* tag 'mfd-next-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd:
  dt-bindings: cros-ec: Fix a typo in description
  dt-bindings: mfd: wlf,arizona: Add spi-max-frequency
  mfd: rt4831: Improve error reporting for problems during .remove()
  mfd: davinci_voicecodec: Fix possible null-ptr-deref davinci_vc_probe()
  mfd: intel-lpss: Add support for ADL-P i2c6 and i2c7
  dt-bindings: mfd: rk808: Convert bindings to yaml
  mfd: twl4030: Make twl4030_exit_irq() return void
  mfd: twl6030: Make twl6030_exit_irq() return void
  dt-bindings: mfd: samsung,exynos5433-lpass: Fix 'dma-channels/requests' properties
  mfd: sprd: Jugle {of,spi}_device_id tables into numerical order
  mfd: sprd: Add SC2730 PMIC to SPI device ID table
  dt-bindings: Drop undocumented i.MX iomuxc-gpr bindings in examples
  mfd: cros_ec_dev: Only register PCHG device if present
  mfd: mt6397-core: Add resources for PMIC keys for MT6359
  mfd: mt6359: Add missing defines necessary for mtk-pmic-keys support
  mfd: ipaq-micro: Fix error check return value of platform_get_irq()
  mfd: hi655x-pmic: Replace legacy gpio interface for gpiod interface
  mfd: tps65218: Fix trivial typo in comment

Merge tag 'clk-for-linus' of git://git./linux/kernel/git/clk/linux

Pull clk updates from Stephen Boyd:
"Mainly driver updates this time around.

  There's a single patch to the core clk framework that simplifies a
  runtime PM call. Otherwise the majority of the diff falls to a few SoC
  drivers: Qualcomm, STM32 and MediaTek. Those SoCs gain some new
  hardware support and what comes along with that is quite a few lines
  of data and some clk_ops code.

  Beyond the new hardware support we have the usual pile of driver
  updates that add missing clks on already supported SoCs or fix up
  problems like bad clk tree descriptions. It's nice to see that more
  drivers are moving to clk_hw based APIs too.

  New Drivers:
   - Add STM32MP13 RCC driver (Reset Clock Controller)
   - MediaTek MT8186 SoC clk support
   - Airoha EN7523 SoC system clocks
   - Clock driver for exynosautov9 SoC
   - Renesas R-Car V4H and RZ/V2M SoCs
   - Renesas RZ/G2UL SoC
   - LPASS clk driver for Qualcomm sc7280 SoC
   - GCC clk driver for Qualcomm SC8280XP SoC

  Updates:
   - SDCC uses floor clk ops on Qualcomm MSM8976
   - Add modem reset and fix RPM clks on Qualcomm MSM8976
   - Add the two missing CLKOUT clocks for U8500/DB8500 SoC
   - Mark some clks critical on Ingenic X1000
   - Convert ux500 to clk_hw
   - Move MediaTek driver to clk_hw provider APIs
   - Use i2c driver probe_new to avoid id scans
   - Convert a number of Rockchip dt bindings to YAML
   - Mark hclk_vo critical on Rockchip rk3568
   - Use pm_runtime_resume_and_get to fix pm_runtime_get_sync() usage
   - Various cleanups like memory allocation error checks and plugged
     leaks
   - Allwinner H6 RTC clock support
   - Allwinner H616 32 kHz clock support
   - Add the Universal Flash Storage clock on Renesas R-Car S4-8
   - Add I2C, SSIF-2 (sound), USB, CANFD, OSTM (timer), WDT, SPI Multi
     I/O Bus, RSPI, TSU (thermal), and ADC clocks and resets on Renesas
     RZ/G2UL
   - Add display clock support on Renesas RZ/G2L
   - Add RPC (QSPI/HyperFlash) clocks on Renesas R-Car E3 and D3
   - Add 27 MHz phy PLL ref clock on i.MX
   - Add mcore_booted module parameter to tell kernel M core has already
     booted for i.MX
   - Remove snvs clock on i.MX because it was for secure world only
   - Add dt bindings for i.MX8MN GPT
   - Add DISP2 pixel clock for i.MX8MP
   - Add clkout1/2 for i.MX8MP
   - Fix parent clock of ubs_root_clk for i.MX8MP
   - Implement better RCG parking on Qualcomm SoCs using the shared RCG
     clk ops
   - Kerneldoc fixes
   - Switch Tegra BPMP to determine_rate clk op
   - Add a pointer to dt schema for generic clock bindings"

* tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: (168 commits)
  Revert "clk: qcom: regmap-mux: add pipe clk implementation"
  Revert "clk: qcom: gcc-sc7280: use new clk_regmap_mux_safe_ops for PCIe pipe clocks"
  Revert "clk: qcom: gcc-sm8450: use new clk_regmap_mux_safe_ops for PCIe pipe clocks"
  clk: bcm: rpi: Use correct order for the parameters of devm_kcalloc()
  clk: stm32mp13: add safe mux management
  clk: stm32mp13: add multi mux function
  clk: stm32mp13: add all STM32MP13 kernel clocks
  clk: stm32mp13: add all STM32MP13 peripheral clocks
  clk: stm32mp13: manage secured clocks
  clk: stm32mp13: add composite clock
  clk: stm32mp13: add stm32 divider clock
  clk: stm32mp13: add stm32_gate management
  clk: stm32mp13: add stm32_mux clock management
  clk: stm32: Introduce STM32MP13 RCC drivers (Reset Clock Controller)
  dt-bindings: rcc: stm32: add new compatible for STM32MP13 SoC
  clk: ti: clkctrl: replace usage of found with dedicated list iterator variable
  clk: ti: composite: Prefer kcalloc over open coded arithmetic
  dt-bindings: clock: exynosautov9: correct count of NR_CLK
  clk: mediatek: mt8173: Switch to clk_hw provider APIs
  clk: mediatek: Switch to clk_hw provider APIs
  ...

Merge tag 'pci-v5.19-changes' of git://git./linux/kernel/git/helgaas/pci

Pull pci updates from Bjorn Helgaas:
"Resource management:

   - Restrict E820 clipping to PCI host bridge windows (Bjorn Helgaas)

   - Log E820 clipping better (Bjorn Helgaas)

   - Add kernel cmdline options to enable/disable E820 clipping (Hans de
     Goede)

   - Disable E820 reserved region clipping for IdeaPads, Yoga, Yoga
     Slip, Acer Spin 5, Clevo Barebone systems where clipping leaves no
     usable address space for touchpads, Thunderbolt devices, etc (Hans
     de Goede)

   - Disable E820 clipping by default starting in 2023 (Hans de Goede)

  PCI device hotplug:

   - Include files to remove implicit dependencies (Christophe Leroy)

   - Only put Root Ports in D3 if they can signal and wake from D3 so
     AMD Yellow Carp doesn't miss hotplug events (Mario Limonciello)

  Power management:

   - Define pci_restore_standard_config() only for CONFIG_PM_SLEEP since
     it's unused otherwise (Krzysztof Kozlowski)

   - Power up devices completely, including anything platform firmware
     needs to do, during runtime resume (Rafael J. Wysocki)

   - Move pci_resume_bus() to PM callbacks so we observe the required
     bridge power-up delays (Rafael J. Wysocki)

   - Drop unneeded runtime_d3cold device flag (Rafael J. Wysocki)

   - Split pci_raw_set_power_state() between pci_power_up() and a new
     pci_set_low_power_state() (Rafael J. Wysocki)

   - Set current_state to D3cold if config read returns ~0, indicating
     the device is not accessible (Rafael J. Wysocki)

   - Do not call pci_update_current_state() from pci_power_up() so BARs
     and ASPM config are restored correctly (Rafael J. Wysocki)

   - Write 0 to PMCSR in pci_power_up() in all cases (Rafael J. Wysocki)

   - Split pci_power_up() to pci_set_full_power_state() to avoid some
     redundant operations (Rafael J. Wysocki)

   - Skip restoring BARs if device is not in D0 (Rafael J. Wysocki)

   - Rearrange and clarify pci_set_power_state() (Rafael J. Wysocki)

   - Remove redundant BAR restores from pci_pm_thaw_noirq() (Rafael J.
     Wysocki)

  Virtualization:

   - Acquire device lock before config space access lock to avoid AB/BA
     deadlock with sriov_numvfs_store() (Yicong Yang)

  Error handling:

   - Clear MULTI_ERR_COR/UNCOR_RCV bits, which a race could previously
     leave permanently set (Kuppuswamy Sathyanarayanan)

  Peer-to-peer DMA:

   - Whitelist Intel Skylake-E Root Ports regardless of which devfn they
     are (Shlomo Pongratz)

  ASPM:

   - Override L1 acceptable latency advertised by Intel DG2 so ASPM L1
     can be enabled (Mika Westerberg)

  Cadence PCIe controller driver:

   - Set up device-specific register to allow PTM Responder to be
     enabled by the normal architected bit (Christian Gmeiner)

   - Override advertised FLR support since the controller doesn't
     implement FLR correctly (Parshuram Thombare)

  Cadence PCIe endpoint driver:

   - Correct bitmap size for the ob_region_map of outbound window usage
     (Dan Carpenter)

  Freescale i.MX6 PCIe controller driver:

   - Fix PERST# assertion/deassertion so we observe the required delays
     before accessing device (Francesco Dolcini)

  Freescale Layerscape PCIe controller driver:

   - Add "big-endian" DT property (Hou Zhiqiang)

   - Update SCFG DT property (Hou Zhiqiang)

   - Add "aer", "pme", "intr" DT properties (Li Yang)

   - Add DT compatible strings for ls1028a (Xiaowei Bao)

  Intel VMD host bridge driver:

   - Assign VMD IRQ domain before enumeration to avoid IOMMU interrupt
     remapping errors when MSI-X remapping is disabled (Nirmal Patel)

   - Revert VMD workaround that kept MSI-X remapping enabled when IOMMU
     remapping was enabled (Nirmal Patel)

  Marvell MVEBU PCIe controller driver:

   - Add of_pci_get_slot_power_limit() to parse the
     'slot-power-limit-milliwatt' DT property (Pali Rohár)

   - Add mvebu support for sending Set_Slot_Power_Limit message (Pali
     Rohár)

  MediaTek PCIe controller driver:

   - Fix refcount leak in mtk_pcie_subsys_powerup() (Miaoqian Lin)

  MediaTek PCIe Gen3 controller driver:

   - Reset PHY and MAC at probe time (AngeloGioacchino Del Regno)

  Microchip PolarFlare PCIe controller driver:

   - Add chained_irq_enter()/chained_irq_exit() calls to mc_handle_msi()
     and mc_handle_intx() to avoid lost interrupts (Conor Dooley)

   - Fix interrupt handling race (Daire McNamara)

  NVIDIA Tegra194 PCIe controller driver:

   - Drop tegra194 MSI register save/restore, which is unnecessary since
     the DWC core does it (Jisheng Zhang)

  Qualcomm PCIe controller driver:

   - Add SM8150 SoC DT binding and support (Bhupesh Sharma)

   - Fix pipe clock imbalance (Johan Hovold)

   - Fix runtime PM imbalance on probe errors (Johan Hovold)

   - Fix PHY init imbalance on probe errors (Johan Hovold)

   - Convert DT binding to YAML (Dmitry Baryshkov)

   - Update DT binding to show that resets aren't required for
     MSM8996/APQ8096 platforms (Dmitry Baryshkov)

   - Add explicit register names per chipset in DT binding (Dmitry
     Baryshkov)

   - Add sc7280-specific clock and reset definitions to DT binding
     (Dmitry Baryshkov)

  Rockchip PCIe controller driver:

   - Fix bitmap size when searching for free outbound region (Dan
     Carpenter)

  Rockchip DesignWare PCIe controller driver:

   - Remove "snps,dw-pcie" from rockchip-dwc DT "compatible" property
     because it's not fully compatible with rockchip (Peter Geis)

   - Reset rockchip-dwc controller at probe (Peter Geis)

   - Add rockchip-dwc INTx support (Peter Geis)

  Synopsys DesignWare PCIe controller driver:

   - Return error instead of success if DMA mapping of MSI area fails
     (Jiantao Zhang)

  Miscellaneous:

   - Change pci_set_dma_mask() documentation references to
     dma_set_mask() (Alex Williamson)"

* tag 'pci-v5.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (64 commits)
  dt-bindings: PCI: qcom: Add schema for sc7280 chipset
  dt-bindings: PCI: qcom: Specify reg-names explicitly
  dt-bindings: PCI: qcom: Do not require resets on msm8996 platforms
  dt-bindings: PCI: qcom: Convert to YAML
  PCI: qcom: Fix unbalanced PHY init on probe errors
  PCI: qcom: Fix runtime PM imbalance on probe errors
  PCI: qcom: Fix pipe clock imbalance
  PCI: qcom: Add SM8150 SoC support
  dt-bindings: pci: qcom: Document PCIe bindings for SM8150 SoC
  x86/PCI: Disable E820 reserved region clipping starting in 2023
  x86/PCI: Disable E820 reserved region clipping via quirks
  x86/PCI: Add kernel cmdline options to use/ignore E820 reserved regions
  PCI: microchip: Fix potential race in interrupt handling
  PCI/AER: Clear MULTI_ERR_COR/UNCOR_RCV bits
  PCI: cadence: Clear FLR in device capabilities register
  PCI: cadence: Allow PTM Responder to be enabled
  PCI: vmd: Revert 2565e5b69c44 ("PCI: vmd: Do not disable MSI-X remapping if interrupt remapping is enabled by IOMMU.")
  PCI: vmd: Assign VMD IRQ domain before enumeration
  PCI: Avoid pci_dev_lock() AB/BA deadlock with sriov_numvfs_store()
  PCI: rockchip-dwc: Add legacy interrupt support
  ...

Merge tag 'mm-stable-2022-05-27' of git://git./linux/kernel/git/akpm/mm

Pull more MM updates from Andrew Morton:

- Two follow-on fixes for the post-5.19 series "Use pageblock_order for
   cma and alloc_contig_range alignment", from Zi Yan.

- A series of z3fold cleanups and fixes from Miaohe Lin.

- Some memcg selftests work from Michal Koutný <mkoutny@suse.com>

- Some swap fixes and cleanups from Miaohe Lin

- Several individual minor fixups

* tag 'mm-stable-2022-05-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (25 commits)
  mm/shmem.c: suppress shift warning
  mm: Kconfig: reorganize misplaced mm options
  mm: kasan: fix input of vmalloc_to_page()
  mm: fix is_pinnable_page against a cma page
  mm: filter out swapin error entry in shmem mapping
  mm/shmem: fix infinite loop when swap in shmem error at swapoff time
  mm/madvise: free hwpoison and swapin error entry in madvise_free_pte_range
  mm/swapfile: fix lost swap bits in unuse_pte()
  mm/swapfile: unuse_pte can map random data if swap read fails
  selftests: memcg: factor out common parts of memory.{low,min} tests
  selftests: memcg: remove protection from top level memcg
  selftests: memcg: adjust expected reclaim values of protected cgroups
  selftests: memcg: expect no low events in unprotected sibling
  selftests: memcg: fix compilation
  mm/z3fold: fix z3fold_page_migrate races with z3fold_map
  mm/z3fold: fix z3fold_reclaim_page races with z3fold_free
  mm/z3fold: always clear PAGE_CLAIMED under z3fold page lock
  mm/z3fold: put z3fold page back into unbuddied list when reclaim or migration fails
  revert "mm/z3fold.c: allow __GFP_HIGHMEM in z3fold_alloc"
  mm/z3fold: throw warning on failure of trylock_page in z3fold_alloc
  ...

Merge tag 'mm-hotfixes-stable-2022-05-27' of git://git./linux/kernel/git/akpm/mm

Pull hotfixes from Andrew Morton:
"Six hotfixes.

  The page_table_check one from Miaohe Lin is considered a minor thing
  so it isn't marked for -stable. The remainder address pre-5.19 issues
  and are cc:stable"

* tag 'mm-hotfixes-stable-2022-05-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm/page_table_check: fix accessing unmapped ptep
  kexec_file: drop weak attribute from arch_kexec_apply_relocations[_add]
  mm/page_alloc: always attempt to allocate at least one page during bulk allocation
  hugetlb: fix huge_pmd_unshare address update
  zsmalloc: fix races between asynchronous zspage free and page migration
  Revert "mm/cma.c: remove redundant cma_mutex lock"

Merge tag 'mm-nonmm-stable-2022-05-26' of git://git./linux/kernel/git/akpm/mm

Pull misc updates from Andrew Morton:
"The non-MM patch queue for this merge window.

  Not a lot of material this cycle. Many singleton patches against
  various subsystems. Most notably some maintenance work in ocfs2
  and initramfs"

* tag 'mm-nonmm-stable-2022-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (65 commits)
  kcov: update pos before writing pc in trace function
  ocfs2: dlmfs: fix error handling of user_dlm_destroy_lock
  ocfs2: dlmfs: don't clear USER_LOCK_ATTACHED when destroying lock
  fs/ntfs: remove redundant variable idx
  fat: remove time truncations in vfat_create/vfat_mkdir
  fat: report creation time in statx
  fat: ignore ctime updates, and keep ctime identical to mtime in memory
  fat: split fat_truncate_time() into separate functions
  MAINTAINERS: add Muchun as a memcg reviewer
  proc/sysctl: make protected_* world readable
  ia64: mca: drop redundant spinlock initialization
  tty: fix deadlock caused by calling printk() under tty_port->lock
  relay: remove redundant assignment to pointer buf
  fs/ntfs3: validate BOOT sectors_per_clusters
  lib/string_helpers: fix not adding strarray to device's resource list
  kernel/crash_core.c: remove redundant check of ck_cmdline
  ELF, uapi: fixup ELF_ST_TYPE definition
  ipc/mqueue: use get_tree_nodev() in mqueue_get_tree()
  ipc: update semtimedop() to use hrtimer
  ipc/sem: remove redundant assignments
  ...

crypto: poly1305 - cleanup stray CRYPTO_LIB_POLY1305_RSIZE

When CRYPTO_LIB_POLY1305 is unset, CRYPTO_LIB_POLY1305_RSIZE
is still set in the Kconfig, cluttering things.

Fix this by making CRYPTO_LIB_POLY1305_RSIZE depend on
CRYPTO_LIB_POLY1305.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

arm64/hugetlb: Fix building errors in huge_ptep_clear_flush()

Fix the arm64 build error which was caused by commit ae07562909f3 ("mm:
change huge_ptep_clear_flush() to return the original pte") interacting
with commit fb396bb459c1 ("arm64/hugetlb: Drop TLB flush from
get_clear_flush()"):

  arch/arm64/mm/hugetlbpage.c: In function ‘huge_ptep_clear_flush’:
  arch/arm64/mm/hugetlbpage.c:515:9: error: implicit declaration of function ‘get_clear_flush’; did you mean ‘ptep_clear_flush’? [-Werror=implicit-function-declaration]
    515 |  return get_clear_flush(vma->vm_mm, addr, ptep, pgsize, ncontig);
        |         ^~~~~~~~~~~~~~~
        |         ptep_clear_flush

Due to the new get_clear_contig() has dropped TLB flush, we should add
an explicit TLB flush in huge_ptep_clear_flush() to keep original
semantics when changing to use new get_clear_contig().

Fixes: fb396bb459c1 ("arm64/hugetlb: Drop TLB flush from get_clear_flush()").
Fixes: ae07562909f3 ("mm: change huge_ptep_clear_flush() to return the original pte")
Reported-and-tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

pipe: Fix missing lock in pipe_resize_ring()

pipe_resize_ring() needs to take the pipe->rd_wait.lock spinlock to
prevent post_one_notification() from trying to insert into the ring
whilst the ring is being replaced.

The occupancy check must be done after the lock is taken, and the lock
must be taken after the new ring is allocated.

The bug can lead to an oops looking something like:

BUG: KASAN: use-after-free in post_one_notification.isra.0+0x62e/0x840
Read of size 4 at addr ffff88801cc72a70 by task poc/27196
...
Call Trace:
  post_one_notification.isra.0+0x62e/0x840
  __post_watch_notification+0x3b7/0x650
  key_create_or_update+0xb8b/0xd20
  __do_sys_add_key+0x175/0x340
  __x64_sys_add_key+0xbe/0x140
  do_syscall_64+0x5c/0xc0
  entry_SYSCALL_64_after_hwframe+0x44/0xae

Reported by Selim Enes Karaduman @Enesdex working with Trend Micro Zero
Day Initiative.

Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-17291
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

smb3: remove unneeded null check in cifs_readdir

Coverity pointed out an unneeded check.

Addresses-Coverity: 1518030 ("Null pointer dereferences")
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

mm/shmem.c: suppress shift warning

mm/shmem.c:1948 shmem_getpage_gfp() warn: should '(((1) << 12) / 512) << folio_order(folio)' be a 64 bit type?

On i386, so an unsigned long is 32-bit, but i_blocks is a 64-bit blkcnt_t.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Jessica Clarke <jrtc27@jrtc27.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: Kconfig: reorganize misplaced mm options

After commits 7b42f1041c98 ("mm: Kconfig: move swap and slab config
options to the MM section") and 519bcb797907 ("mm: Kconfig: group swap,
slab, hotplug and thp options into submenus") we now have nicely organized
mm related config options. I have noticed some that were still misplaced,
so this moves them from various places into the new structure:

VM_EVENT_COUNTERS, COMPAT_BRK, MMAP_ALLOW_UNINITIALIZED to mm/Kconfig and
general MM section.

SLUB_STATS to mm/Kconfig and the slab submenu.

DEBUG_SLAB, SLUB_DEBUG, SLUB_DEBUG_ON to mm/Kconfig.debug and the Kernel
hacking / Memory Debugging submenu.

Link: https://lkml.kernel.org/r/20220525112559.1139-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: kasan: fix input of vmalloc_to_page()

When print virtual mapping info for vmalloc address, it should pass
the addr not page, fix it.

Link: https://lkml.kernel.org/r/20220525120804.38155-1-wangkefeng.wang@huawei.com
Fixes: c056a364e954 ("kasan: print virtual mapping info in reports")
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: fix is_pinnable_page against a cma page

Pages in the CMA area could have MIGRATE_ISOLATE as well as MIGRATE_CMA so
the current is_pinnable_page() could miss CMA pages which have
MIGRATE_ISOLATE.  It ends up pinning CMA pages as longterm for the
pin_user_pages() API so CMA allocations keep failing until the pin is
released.

     CPU 0                                   CPU 1 - Task B

cma_alloc
alloc_contig_range
                                        pin_user_pages_fast(FOLL_LONGTERM)
change pageblock as MIGRATE_ISOLATE
                                        internal_get_user_pages_fast
                                        lockless_pages_from_mm
                                        gup_pte_range
                                        try_grab_folio
                                        is_pinnable_page
                                          return true;
                                        So, pinned the page successfully.
page migration failure with pinned page
                                        ..
                                        .. After 30 sec
                                        unpin_user_page(page)

CMA allocation succeeded after 30 sec.

The CMA allocation path protects the migration type change race using
zone->lock but what GUP path need to know is just whether the page is on
CMA area or not rather than exact migration type.  Thus, we don't need
zone->lock but just checks migration type in either of (MIGRATE_ISOLATE
and MIGRATE_CMA).

Adding the MIGRATE_ISOLATE check in is_pinnable_page could cause rejecting
of pinning pages on MIGRATE_ISOLATE pageblocks even though it's neither
CMA nor movable zone if the page is temporarily unmovable.  However, such
a migration failure by unexpected temporal refcount holding is general
issue, not only come from MIGRATE_ISOLATE and the MIGRATE_ISOLATE is also
transient state like other temporal elevated refcount problem.

Link: https://lkml.kernel.org/r/20220524171525.976723-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: filter out swapin error entry in shmem mapping

There might be swapin error entries in shmem mapping. Filter them out to
avoid "Bad swap file entry" complaint.

Link: https://lkml.kernel.org/r/20220519125030.21486-6-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: NeilBrown <neilb@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/shmem: fix infinite loop when swap in shmem error at swapoff time

When swap in shmem error at swapoff time, there would be a infinite loop
in the while loop in shmem_unuse_inode().  It's because swapin error is
deliberately ignored now and thus info->swapped will never reach 0.  So we
can't escape the loop in shmem_unuse().

In order to fix the issue, swapin_error entry is stored in the mapping
when swapin error occurs.  So the swapcache page can be freed and the user
won't end up with a permanently mounted swap because a sector is bad.  If
the page is accessed later, the user process will be killed so that
corrupted data is never consumed.  On the other hand, if the page is never
accessed, the user won't even notice it.

Link: https://lkml.kernel.org/r/20220519125030.21486-5-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reported-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: NeilBrown <neilb@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/madvise: free hwpoison and swapin error entry in madvise_free_pte_range

Once the MADV_FREE operation has succeeded, callers can expect they might
get zero-fill pages if accessing the memory again. Therefore it should be
safe to delete the hwpoison entry and swapin error entry. There is no
reason to kill the process if it has called MADV_FREE on the range.

Link: https://lkml.kernel.org/r/20220519125030.21486-4-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Suggested-by: Alistair Popple <apopple@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: NeilBrown <neilb@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/swapfile: fix lost swap bits in unuse_pte()

This is observed by code review only but not any real report.

When we turn off swapping we could have lost the bits stored in the swap
ptes. The new rmap-exclusive bit is fine since that turned into a page
flag, but not for soft-dirty and uffd-wp. Add them.

Link: https://lkml.kernel.org/r/20220519125030.21486-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Suggested-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/swapfile: unuse_pte can map random data if swap read fails

Patch series "A few fixup patches for mm", v4.

This series contains a few patches to avoid mapping random data if swap
read fails and fix lost swap bits in unuse_pte.  Also we free hwpoison and
swapin error entry in madvise_free_pte_range and so on.  More details can
be found in the respective changelogs.

This patch (of 5):

There is a bug in unuse_pte(): when swap page happens to be unreadable,
page filled with random data is mapped into user address space.  In case
of error, a special swap entry indicating swap read fails is set to the
page table.  So the swapcache page can be freed and the user won't end up
with a permanently mounted swap because a sector is bad.  And if the page
is accessed later, the user process will be killed so that corrupted data
is never consumed.  On the other hand, if the page is never accessed, the
user won't even notice it.

Link: https://lkml.kernel.org/r/20220519125030.21486-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20220519125030.21486-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Howells <dhowells@redhat.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftests: memcg: factor out common parts of memory.{low,min} tests

The memory protection test setup and runtime is almost equal for
memory.low and memory.min cases.

It makes modification of the common parts prone to mistakes, since the
protections are similar not only in setup but also in principle, factor
the common part out.

Past exceptions between the tests:
- missing memory.min is fine (kept),
- test_memcg_low protected orphaned pagecache (adapted like
test_memcg_min and we keep the processes of protected memory running).

The evaluation in two tests is different (OOM of allocator vs low events
of protégés), this is kept different.

Link: https://lkml.kernel.org/r/20220518161859.21565-6-mkoutny@suse.com
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
CC: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Richard Palethorpe <rpalethorpe@suse.de>
Cc: David Vernet <void@manifault.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftests: memcg: remove protection from top level memcg

The reclaim is triggered by memory limit in a subtree, therefore the
testcase does not need configured protection against external reclaim.

Also, correct respective comments.

Link: https://lkml.kernel.org/r/20220518161859.21565-5-mkoutny@suse.com
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: David Vernet <void@manifault.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Richard Palethorpe <rpalethorpe@suse.de>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftests: memcg: adjust expected reclaim values of protected cgroups

The numbers are not easy to derive in a closed form (certainly mere
protections ratios do not apply), therefore use a simulation to obtain
expected numbers.

Link: https://lkml.kernel.org/r/20220518161859.21565-4-mkoutny@suse.com
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: David Vernet <void@manifault.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Richard Palethorpe <rpalethorpe@suse.de>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftests: memcg: expect no low events in unprotected sibling

This is effectively a revert of commit cdc69458a5f3 ("cgroup: account for
memory_recursiveprot in test_memcg_low()"). The case test_memcg_low will
fail with memory_recursiveprot until resolved in reclaim code.

However, this patch preserves the existing helpers and variables for later
uses.

Link: https://lkml.kernel.org/r/20220518161859.21565-3-mkoutny@suse.com
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: David Vernet <void@manifault.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Richard Palethorpe <rpalethorpe@suse.de>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftests: memcg: fix compilation

Patch series "memcontrol selftests fixups", v2.

Flushing the patches to make memcontrol selftests check the events
behavior we had consensus about (test_memcg_low fails).

(test_memcg_reclaim, test_memcg_swap_max fail for me now but it's present
even before the refactoring.)

The two bigger changes are:
- adjustment of the protected values to make tests succeed with the given
  tolerance,
- both test_memcg_low and test_memcg_min check protection of memory in
  populated cgroups (actually as per Documentation/admin-guide/cgroup-v2.rst
  memory.min should not apply to empty cgroups, which is not the case
  currently. Therefore I unified tests with the populated case in order to to
  bring more broken tests).

This patch (of 5):

This fixes mis-applied changes from commit 72b1e03aa725 ("cgroup: account
for memory_localevents in test_memcg_oom_group_leaf_events()").

Link: https://lkml.kernel.org/r/20220518161859.21565-1-mkoutny@suse.com
Link: https://lkml.kernel.org/r/20220518161859.21565-2-mkoutny@suse.com
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: David Vernet <void@manifault.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Richard Palethorpe <rpalethorpe@suse.de>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/z3fold: fix z3fold_page_migrate races with z3fold_map

Think about the below scenario:

CPU1 CPU2
z3fold_page_migrate z3fold_map
  z3fold_page_trylock
  ...
  z3fold_page_unlock
  /* slots still points to old zhdr*/
get_z3fold_header
  get slots from handle
  get old zhdr from slots
  z3fold_page_trylock
  return *old* zhdr
  encode_handle(new_zhdr, FIRST|LAST|MIDDLE)
  put_page(page) /* zhdr is freed! */
but zhdr is still used by caller!

z3fold_map can map freed z3fold page and lead to use-after-free bug.  To
fix it, we add PAGE_MIGRATED to indicate z3fold page is migrated and soon
to be released.  So get_z3fold_header won't return such page.

Link: https://lkml.kernel.org/r/20220429064051.61552-10-linmiaohe@huawei.com
Fixes: 1f862989b04a ("mm/z3fold.c: support page migration")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/z3fold: fix z3fold_reclaim_page races with z3fold_free

Think about the below scenario:

CPU1 CPU2
z3fold_reclaim_page z3fold_free
spin_lock(&pool->lock) get_z3fold_header -- hold page_lock
kref_get_unless_zero
kref_put--zhdr->refcount can be 1 now
!z3fold_page_trylock
  kref_put -- zhdr->refcount is 0 now
   release_z3fold_page
    WARN_ON(!list_empty(&zhdr->buddy)); -- we're on buddy now!
    spin_lock(&pool->lock); -- deadlock here!

z3fold_reclaim_page might race with z3fold_free and will lead to pool lock
deadlock and zhdr buddy non-empty warning.  To fix this, defer getting the
refcount until page_lock is held just like what __z3fold_alloc does.  Note
this has the side effect that we won't break the reclaim if we meet a soon
to be released z3fold page now.

Link: https://lkml.kernel.org/r/20220429064051.61552-9-linmiaohe@huawei.com
Fixes: dcf5aedb24f8 ("z3fold: stricter locking and more careful reclaim")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/z3fold: always clear PAGE_CLAIMED under z3fold page lock

Think about the below race window:

CPU1 CPU2
z3fold_reclaim_page z3fold_free
test_and_set_bit PAGE_CLAIMED
failed to reclaim page
z3fold_page_lock(zhdr);
add back to the lru list;
z3fold_page_unlock(zhdr);
get_z3fold_header
page_claimed=test_and_set_bit PAGE_CLAIMED

clear_bit(PAGE_CLAIMED, &page->private);

if (!page_claimed) /* it's false true */
free_handle is not called

free_handle won't be called in this case. So z3fold_buddy_slots will leak.
Fix it by always clear PAGE_CLAIMED under z3fold page lock.

Link: https://lkml.kernel.org/r/20220429064051.61552-8-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>