review.tizen.org Git - platform/kernel/linux-rpi.git/log

Merge tag 'tegra-for-5.15-soc' of git://git./linux/kernel/git/tegra/linux into arm/drivers

soc/tegra: Changes for v5.15-rc1

Implements runtime PM support for the FUSE block and prepares the driver
to work better in conjunction with the CPUIDLE driver.

* tag 'tegra-for-5.15-soc' of git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux:
  soc/tegra: fuse: Enable fuse clock on suspend for Tegra124
  soc/tegra: fuse: Add runtime PM support
  soc/tegra: fuse: Clear fuse->clk on driver probe failure
  soc/tegra: pmc: Prevent racing with cpuilde driver
  soc/tegra: bpmp: Remove unused including <linux/version.h>

Link: https://lore.kernel.org/r/20210813162157.2820913-3-thierry.reding@gmail.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Merge tag 'tegra-for-5.15-firmware' of git://git./linux/kernel/git/tegra/linux into arm/drivers

firmware: tegra: Changes for v5.15-rc1

This contains a single fix to stop a slight abuse of the seq_buf API.

* tag 'tegra-for-5.15-firmware' of git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux:
firmware: tegra: Stop using seq_get_buf()

Link: https://lore.kernel.org/r/20210813162157.2820913-2-thierry.reding@gmail.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Merge tag 'scmi-updates-5.15' of git://git./linux/kernel/git/sudeep.holla/linux into arm/drivers

SCMI Updates for v5.15

The bulk of the addition this time is mainly refactoring to add support
for Virtio transport for SCMI and the addition of the support itself.
The refactoring includes allowing transport specific init/exit calls,
making each transport as compile time configurable, supporting
monotonically increasing tokens instead of using the next available
free buffer index as the token for scmi messages which eases handling
concurrent and out-of-order messages which is a must have for virtio
transport.

Virtio support itself is conformant to the virtio SCMI device spec [1].
Virtio device id 32 has been reserved for the SCMI device [2].

Other than the virtio support, there is one bug fix in the probe failure
clean up path.

[1] https://github.com/oasis-tcs/virtio-spec/blob/master/virtio-scmi.tex
[2] https://www.oasis-open.org/committees/ballot.php?id=3496

* tag 'scmi-updates-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux:
  firmware: arm_scmi: Use WARN_ON() to check configured transports
  firmware: arm_scmi: Fix boolconv.cocci warnings
  firmware: arm_scmi: Free mailbox channels if probe fails
  firmware: arm_scmi: Add virtio transport
  firmware: arm_scmi: Add priv parameter to scmi_rx_callback
  dt-bindings: arm: Add virtio transport for SCMI
  firmware: arm_scmi: Add optional link_supplier() transport op
  firmware: arm_scmi: Add message passing abstractions for transports
  firmware: arm_scmi: Add method to override max message number
  firmware: arm_scmi: Make shmem support optional for transports
  firmware: arm_scmi: Make SCMI transports configurable
  firmware: arm_scmi: Make polling mode optional
  firmware: arm_scmi: Make .clear_channel optional
  firmware: arm_scmi: Handle concurrent and out-of-order messages
  firmware: arm_scmi: Introduce monotonically increasing tokens
  firmware: arm_scmi: Add optional transport_init/exit support
  firmware: arm_scmi: Remove scmi_dump_header_dbg() helper
  firmware: arm_scmi: Add support for type handling in common functions

Link: https://lore.kernel.org/r/20210811075743.707961-1-sudeep.holla@arm.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Merge tag 'drivers_soc_for_5.15' of git://git./linux/kernel/git/ssantosh/linux-keystone into arm/drivers

soc: Keystone SOC drivers for v5.15

The pull request contains:
- ICSSG subsystem support for Keystone3 AM64x SOCs
- Removes smartrefelx PM dependency for deeper low power states

* tag 'drivers_soc_for_5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux-keystone:
  dt-bindings: soc: ti: pruss: Add dma-coherent property
  soc: ti: Remove pm_runtime_irq_safe() usage for smartreflex
  soc: ti: pruss: Enable support for ICSSG subsystems on K3 AM64x SoCs
  dt-bindings: soc: ti: pruss: Update bindings for K3 AM64x SoCs

Link: https://lore.kernel.org/r/0A637A41-2353-4900-962C-DBE50BBDE75A@oracle.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Merge tag 'memory-controller-drv-5.15' of git://git./linux/kernel/git/krzk/linux-mem-ctrl into arm/drivers

Memory controller drivers for v5.15

Few minor fixes: maintainer pattern, unused-function warning in
Tegra186 and suspend/resume in OMAP GPMC.

* tag 'memory-controller-drv-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux-mem-ctrl:
  memory: omap-gpmc: Drop custom PM calls with cpu_pm notifier
  memory: omap-gpmc: Clear GPMC_CS_CONFIG7 register on restore if unused
  MAINTAINERS: update arm,pl353-smc.yaml reference
  memory: tegra: fix unused-function warning

Link: https://lore.kernel.org/r/20210809152639.110576-1-krzysztof.kozlowski@canonical.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Merge tag 'v5.14-next-soc' of git://git./linux/kernel/git/matthias.bgg/linux into arm/drivers

pm-domains:
- correct mask define if used for update on TOPAXI bus
- mt8173: enable regulator befor turning on MFG_ASYNC

mmsys:
- add a mask property to the routing information
- add support for MT8365
- add UFOE routing for MT8173

* tag 'v5.14-next-soc' of git://git.kernel.org/pub/scm/linux/kernel/git/matthias.bgg/linux:
  soc: mediatek: mmsys: Fix missing UFOE component in mt8173 table routing
  soc: mediatek: mmsys: add MT8365 support
  soc: mmsys: mediatek: add mask to mmsys routes
  soc: mediatek: pm-domains: Add domain_supply cap for mfg_async PD
  soc: mediatek: pm-domains: Use correct mask for bus_prot_clr

Link: https://lore.kernel.org/r/49fc7bef-20db-b98c-9437-dd9e4d00e870@gmail.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Merge tag 'imx-ecspi-5.15' of git://git./linux/kernel/git/shawnguo/linux into arm/drivers

i.MX eCSPI errata handling for 5.15:

It includes all required changes for handling i.MX6/7 eCSPI errata
ERR009165, which causes FIFO transfer to be sent twice in DMA mode.
Both SPI and DMA maintainers agree to merge it through arm-soc tree.

* tag 'imx-ecspi-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux:
  dmaengine: imx-sdma: add terminated list for freed descriptor in worker
  dmaengine: imx-sdma: add uart rom script
  dma: imx-sdma: add i.mx6ul compatible name
  dmaengine: imx-sdma: remove ERR009165 on i.mx6ul
  spi: imx: remove ERR009165 workaround on i.mx6ul
  spi: imx: fix ERR009165
  dmaengine: imx-sdma: add mcu_2_ecspi script
  dmaengine: dma: imx-sdma: add fw_loaded and is_ram_script
  dmaengine: imx-sdma: remove duplicated sdma_load_context
  Revert "dmaengine: imx-sdma: refine to load context only once"
  Revert "ARM: dts: imx6: Use correct SDMA script for SPI cores"
  Revert "ARM: dts: imx6q: Use correct SDMA script for SPI5 core"

Link: https://lore.kernel.org/r/20210809071838.GF30984@dragon
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

bus: ixp4xx: return on error in ixp4xx_exp_probe()

This code was intended to return an error code if regmap_read() fails
but the return statement was missing.

Fixes: 1c953bda90ca ("bus: ixp4xx: Add a driver for IXP4xx expansion bus")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lore.kernel.org/r/20210807230016.3607666-1-linus.walleij@linaro.org'
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Merge tag 'omap-for-v5.15/ti-sysc-signed' of git://git./linux/kernel/git/tmlind/linux-omap into arm/drivers

Driver changes for ti-sysc for v5.15

Few ti-sysc changes to handle quirk for McASP SIDLE mode, correct
documentation for sysc_ioremap(), and start using
pm_runtime_resume_and_get().

* tag 'omap-for-v5.15/ti-sysc-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap:
  bus: ti-sysc: Add quirk for OMAP4 McASP to disable SIDLE mode
  bus: ti-sysc: using pm_runtime_resume_and_get instead of pm_runtime_get_sync
  bus: ti-sysc: Correct misdocumentation of 'sysc_ioremap()'
  bus: ti-sysc: Fix gpt12 system timer issue with reserved status

Link: https://lore.kernel.org/r/pull-1628153040-834155@atomide.com-2
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

firmware: tegra: Stop using seq_get_buf()

Opencode a copy of mrq_debug_read() in bpmp_debug_show() so that it
can use seq_write() directly instead of poking holes into the seq_file
abstractions using seq_get_buf().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Thierry Reding <treding@nvidia.com>

soc/tegra: fuse: Enable fuse clock on suspend for Tegra124

The FUSE clock should be enabled during suspend on Tegra124. Currently
clk driver enables it on all SoCs, but FUSE may require a higher core
voltage on Tegra30 while enabled. Move the quirk into the FUSE driver
and make it specific to Tegra124.

Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>

soc/tegra: fuse: Add runtime PM support

The Tegra FUSE belongs to the core power domain and we're going to enable
GENPD support for the core domain. Now FUSE device must be resumed using
runtime PM API in order to initialize the FUSE power state. Add runtime PM
support to the FUSE driver.

Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>

soc/tegra: fuse: Clear fuse->clk on driver probe failure

The fuse->clk must be cleared if FUSE driver fails to probe, otherwise
tegra_fuse_readl() will crash. It's unlikely to happen in practice,
nevertheless let's correct it for completeness.

Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>

soc/tegra: pmc: Prevent racing with cpuilde driver

Both PMC and cpuidle drivers are probed at the same init level and
cpuidle depends on the PMC suspend mode. Add new default suspend mode
that indicates whether PMC driver has been probed and reset the mode in
a case of deferred probe of the PMC driver.

Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>

soc/tegra: bpmp: Remove unused including <linux/version.h>

Remove including <linux/version.h> that don't need it.

V1->V2: Split the patch in two

Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Thierry Reding <treding@nvidia.com>

dt-bindings: soc: ti: pruss: Add dma-coherent property

Update the PRUSS schema file to include the dma-coherent property
that indicates the coherency of the IP. The PRUSS IPs on 66AK2G
SoCs do use this property.

The new added dma-coherent property is a required property _only_
for 66AK2G SoCs and is not required/applicable for other SoCs, so
the binding is backward compatible for other SoCs. This update is
being done before the corresponding dts nodes can be added for 66AK2G
SoCs.

Signed-off-by: Suman Anna <s-anna@ti.com>
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

soc: ti: Remove pm_runtime_irq_safe() usage for smartreflex

For the smartreflex device, we need to disable smartreflex on SoC idle,
and have been using pm_runtime_irq_safe() to do that. But we want to
remove the irq_safe usage as PM runtime takes a permanent usage count
on the parent device with it.

In order to remove the need for pm_runtime_irq_safe(), let's gate
the clock directly in the driver. This removes the need to call PM runtime
during idle, and allows us to switch to using CPU_PM in the following
patch.

Note that the smartreflex interconnect target module is configured for smart
idle, but the clock does not have autoidle capability, and needs to be gated
manually. If the clock supported autoidle, we would not need to even gate
the clock.

With this change, we can now remove the related quirk flags for ti-sysc
also.

Signed-off-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

soc: ti: pruss: Enable support for ICSSG subsystems on K3 AM64x SoCs

The K3 AM64x family of SoCs have a similar version of the PRU-ICSS (ICSSG)
processor subsystem present on K3 J721E and K3 AM65x SR2.0 SoCs. These SoCs
contain typically two ICSSG instances named ICSSG0 and ICSSG1. The two
ICSSGs are identical to each other for the most part with minor SoC
integration differences and capabilities. SGMII mode is not supported at
all on these SoCs (unlike specific instances on AM65x, J721E). The ICSSG1
also has limited pins connected on some sub-modules compared to ICSSG0.

There is no change in the Interrupt Controller w.r.t either of AM65x or
J721E SoCs. All other integration aspects are also very similar to the
existing SoCs.

The existing pruss platform driver has been updated to support these
similar ICSSG instances through a new AM64x specific compatible.

Signed-off-by: Suman Anna <s-anna@ti.com>
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

dt-bindings: soc: ti: pruss: Update bindings for K3 AM64x SoCs

The K3 AM64x SoCs also have the Gigabit Ethernet capable PRU-ICSS IP
that is present on existing K3 AM65x and J721E SoCs (ICSSG). The IP
is similar to the ones used on K3 J721E or AM65x SR2.0 SoCs.

Update the PRUSS bindings for these ICSSG instances.

Signed-off-by: Suman Anna <s-anna@ti.com>
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

firmware: arm_scmi: Use WARN_ON() to check configured transports

Use a WARN_ON() when SCMI stack is loaded to check the consistency of
configured SCMI transports instead of the current compile-time check
BUILD_BUG_ON() to avoid breaking bot-builds on random bad configs.

Bail-out early and noisy during SCMI stack initialization if no transport
was enabled in configuration since SCMI cannot work without at least one
enabled transport and such constraint cannot be enforced in Kconfig due to
circular dependency issues.

Link: https://lore.kernel.org/r/20210809092245.8730-1-cristian.marussi@arm.com
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>

firmware: arm_scmi: Fix boolconv.cocci warnings

drivers/firmware/arm_scmi/virtio.c:225:40-45: WARNING: conversion to bool not needed here

Remove unneeded conversion to bool

Semantic patch information:
Relational and logical operators evaluate to bool,
explicit conversion is overly verbose and unneeded.

Generated by: scripts/coccinelle/misc/boolconv.cocci

Link: https://lore.kernel.org/r/20210807173127.GA43248@a24dbc127934
CC: Igor Skalkin <igor.skalkin@opensynergy.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: kernel test robot <lkp@intel.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>

soc: mediatek: mmsys: Fix missing UFOE component in mt8173 table routing

The UFOE (data compression engine) component needs to be enabled to have
the imgtec gpu driver working. If we don't enable it we see a black screen.
Looks like when we switched to use and array for setting the routing
registers in commit 440147639ac7 ("soc: mediatek: mmsys: Use an array for
setting the routing registers") we missed to add this component in the new
routing table, it was present before that commit, so fix it by adding
this component in the mt8173 routing table.

Fixes: 440147639ac7 ("soc: mediatek: mmsys: Use an array for setting the routing registers")
Signed-off-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
Tested-by: Eizan Miyamoto <eizan@chromium.org>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20210625062448.3462177-1-enric.balletbo@collabora.com
[mb: taking into account mask value]
Signed-off-by: Matthias Brugger <matthias.bgg@gmail.com>

soc: mediatek: mmsys: add MT8365 support

Add DSI mmsys connections for the MT8365 SoC.

Signed-off-by: Fabien Parent <fparent@baylibre.com>
Link: https://lore.kernel.org/r/20210519161847.3747352-3-fparent@baylibre.com
[mb: take the mask field into account]
Signed-off-by: Matthias Brugger <matthias.bgg@gmail.com>

firmware: arm_scmi: Free mailbox channels if probe fails

Mailbox channels for the base protocol are setup during probe.
There can be a scenario where probe fails to acquire the base
protocol due to a timeout leading to cleaning up of all device
managed memory including the scmi_mailbox structure setup during
mailbox_chan_setup function.

    | arm-scmi soc:qcom,scmi: timed out in resp(caller: version_get+0x84/0x140)
    | arm-scmi soc:qcom,scmi: unable to communicate with SCMI
    | arm-scmi: probe of soc:qcom,scmi failed with error -110

Now when a message arrives at cpu slightly after the timeout, the mailbox
controller will try to call the rx_callback of the client and might end
up accessing freed memory.

     | rx_callback+0x24/0x160
     | mbox_chan_received_data+0x44/0x94
     | __handle_irq_event_percpu+0xd4/0x240

This patch frees the mailbox channels setup during probe and adds some more
error handling in case the probe fails.

Link: https://lore.kernel.org/r/1628111999-21595-1-git-send-email-rishabhb@codeaurora.org
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Reviewed-by: Cristian Marussi <cristian.marussi@arm.com>
Signed-off-by: Rishabh Bhatnagar <rishabhb@codeaurora.org>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>

firmware: arm_scmi: Add virtio transport

This transport enables communications with an SCMI platform through virtio;
the SCMI platform will be represented by a virtio device.

Implement an SCMI virtio driver according to the virtio SCMI device spec
[1]. Virtio device id 32 has been reserved for the SCMI device [2].

The virtio transport has one Tx channel (virtio cmdq, A2P channel) and
at most one Rx channel (virtio eventq, P2A channel).

The following feature bit defined in [1] is not implemented:
VIRTIO_SCMI_F_SHARED_MEMORY.

The number of messages which can be pending simultaneously is restricted
according to the virtqueue capacity negotiated at probing time.

As soon as Rx channel message buffers are allocated or have been read
out by the arm-scmi driver, feed them back to the virtio device.

Since some virtio devices may not have the short response time exhibited
by SCMI platforms using other transports, set a generous response
timeout.

SCMI polling mode is not supported by this virtio transport since deemed
meaningless: polling mode operation is offered by the SCMI core to those
transports that could not provide a completion interrupt on the TX path,
which is never the case for virtio whose core callbacks can easily call
into core scmi_rx_callback upon messages reception.

[1] https://github.com/oasis-tcs/virtio-spec/blob/master/virtio-scmi.tex
[2] https://www.oasis-open.org/committees/ballot.php?id=3496

Link: https://lore.kernel.org/r/20210803131024.40280-16-cristian.marussi@arm.com
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Co-developed-by: Peter Hilber <peter.hilber@opensynergy.com>
Co-developed-by: Cristian Marussi <cristian.marussi@arm.com>
Signed-off-by: Igor Skalkin <igor.skalkin@opensynergy.com>
[ Peter: Adapted patch for submission to upstream. ]
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
[ Cristian: simplified driver logic, changed link_supplier and channel
available/setup logic, removed dummy callbacks ]
Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>

firmware: arm_scmi: Add priv parameter to scmi_rx_callback

Add a new opaque void *priv parameter to scmi_rx_callback which can be
optionally provided by the transport layer when invoking scmi_rx_callback
and that will be passed back to the transport layer in xfer->priv.

This can be used by transports that needs to keep track of their specific
data structures together with the valid xfers.

Link: https://lore.kernel.org/r/20210803131024.40280-15-cristian.marussi@arm.com
Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>

dt-bindings: arm: Add virtio transport for SCMI

Document the properties for arm,scmi-virtio compatible nodes.
The backing virtio SCMI device is described in patch [1].

While doing that, make shmem property required only for pre-existing
mailbox and smc transports, since virtio-scmi does not need it.

[1] https://lists.oasis-open.org/archives/virtio-comment/202102/msg00018.html

Link: https://lore.kernel.org/r/20210803131024.40280-14-cristian.marussi@arm.com
Co-developed-by: Peter Hilber <peter.hilber@opensynergy.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Igor Skalkin <igor.skalkin@opensynergy.com>
[ Peter: Adapted patch for submission to upstream. ]
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
[ Cristian: converted to yaml format, moved shmen required property. ]
Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>

firmware: arm_scmi: Add optional link_supplier() transport op

Some transports are also effectively registered with other kernel subsystem
in order to be properly probed and initialized; as a consequence such kind
of transports, and their related devices, might still not have been probed
and initialized at the time the main SCMI core driver is probed.

Add an optional .link_supplier() transport operation which can be used by
the core SCMI stack to dynamically check if the transport is ready and
dynamically link its device to the SCMI platform instance device.

Link: https://lore.kernel.org/r/20210803131024.40280-13-cristian.marussi@arm.com
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
[ Cristian: reworded commit message ]
Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>

firmware: arm_scmi: Add message passing abstractions for transports

Add abstractions for future transports using message passing, such as
virtio. Derive the abstractions from the shared memory abstractions.

Abstract the transport SDU through the opaque struct scmi_msg_payld.
Also enable the transport to determine all other required information
about the transport SDU.

Link: https://lore.kernel.org/r/20210803131024.40280-12-cristian.marussi@arm.com
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
[ Cristian: Adapted to new SCMI Kconfig layout, updated Copyrights ]
Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>

firmware: arm_scmi: Add method to override max message number

The maximum number of simultaneously pending messages is a transport
specific quantity that is usually described statically in struct scmi_desc.

Some transports, though, can calculate such number only at run-time after
some initial transport specific setup and probing is completed; moreover
the resulting max message numbers could also be different between rx and
tx channels.

Add an optional get_max_msg() operation so that a transport can report more
accurate max message numbers for each channel type.

The value in scmi_desc.max_msg is still used as default when transport does
not provide any get_max_msg() method.

Link: https://lore.kernel.org/r/20210803131024.40280-11-cristian.marussi@arm.com
Co-developed-by: Peter Hilber <peter.hilber@opensynergy.com>
Co-developed-by: Cristian Marussi <cristian.marussi@arm.com>
Signed-off-by: Igor Skalkin <igor.skalkin@opensynergy.com>
[ Peter: Adapted patch for submission to upstream. ]
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
[ Cristian: refactored how get_max_msg() is used to minimize core changes ]
Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>

firmware: arm_scmi: Make shmem support optional for transports

Upcoming new SCMI transports won't need any kind of shared memory support.
Compile shmem.c only if a shmem based transport is selected.

Link: https://lore.kernel.org/r/20210803131024.40280-10-cristian.marussi@arm.com
Co-developed-by: Peter Hilber <peter.hilber@opensynergy.com>
Signed-off-by: Igor Skalkin <igor.skalkin@opensynergy.com>
[ Peter: Adapted patch for submission to upstream. ]
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
[ Cristian: Adapted patch/commit_msg to new SCMI Kconfig layout ]
Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>

firmware: arm_scmi: Make SCMI transports configurable

Add configuration options to be able to select which SCMI transports have
to be compiled into the SCMI stack.

Mailbox and SMC are by default enabled if their related dependencies are
satisfied.

While doing that move all SCMI related config options in their own
dedicated submenu.

Link: https://lore.kernel.org/r/20210803131024.40280-9-cristian.marussi@arm.com
Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>

firmware: arm_scmi: Make polling mode optional

Add a check for the presence of .poll_done transport operation so that
transports that do not need to support polling mode have no need to provide
a dummy .poll_done callback either and polling mode can be disabled in the
SCMI core for that tranport.

Link: https://lore.kernel.org/r/20210803131024.40280-8-cristian.marussi@arm.com
Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>

firmware: arm_scmi: Make .clear_channel optional

Make transport operation .clear_channel optional since some transports
do not need it and so avoid to have them implement dummy callbacks.

Link: https://lore.kernel.org/r/20210803131024.40280-7-cristian.marussi@arm.com
Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>

firmware: arm_scmi: Handle concurrent and out-of-order messages

Even though in case of asynchronous commands an SCMI platform is
constrained to emit the delayed response message only after the related
message response has been sent, the configured underlying transport could
still deliver such messages together or in inverted order, causing races
due to the concurrent or out-of-order access to the underlying xfer.

Introduce a mechanism to grant exclusive access to an xfer in order to
properly serialize concurrent accesses to the same xfer originating from
multiple correlated messages.

Add additional state information to xfer descriptors so as to be able to
identify out-of-order message deliveries and act accordingly:

- when a delayed response is expected but delivered before the related
   response, the synchronous response is considered as successfully
   received and the delayed response processing is carried on as usual.

- when/if the missing synchronous response is subsequently received, it
   is discarded as not congruent with the current state of the xfer, or
   simply, because the xfer has been already released and so, now, the
   monotonically increasing sequence number carried by the late response
   is stale.

Link: https://lore.kernel.org/r/20210803131024.40280-6-cristian.marussi@arm.com
Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>

firmware: arm_scmi: Introduce monotonically increasing tokens

Tokens are sequence numbers embedded in the each SCMI message header: they
are used to correlate commands with responses (and delayed responses), but
their usage and policy of selection is entirely up to the caller (usually
the OSPM agent), while they are completely opaque to the callee (i.e. SCMI
platform) which merely copies them back from the command into the response
message header.
This also means that the platform does not, can not and should not enforce
any kind of policy on received messages depending on the contained sequence
number: platform can perfectly handle concurrent requests carrying the same
identifiying token if that should happen.

Moreover the platform is not required to produce in-order responses to
agent requests, the only constraint in these regards is that in case of
an asynchronous message the delayed response must be sent after the
immediate response for the synchronous part of the command transaction.

Currenly the SCMI stack of the OSPM agent selects a token for the egressing
commands picking the lowest possible number which is not already in use by
an existing in-flight transaction, which means, in other words, that we
immediately reuse any token after its transaction has completed or it has
timed out: this policy indeed does simplify management and lookup of tokens
and associated xfers.

Under the above assumptions and constraints, since there is really no state
shared between the agent and the platform to let the platform know when a
token and its associated message has timed out, the current policy of early
reuse of tokens can easily lead to the situation in which a spurious or
late received response (or delayed_response), related to an old stale and
timed out transaction, can be wrongly associated to a newer valid in-flight
xfer that just happens to have reused the same token.

This misbehaviour on such late/spurious responses is more easily exposed on
those transports that naturally have an higher level of parallelism in
processing multiple concurrent in-flight messages.

This commit introduces a new policy of selection of tokens for the OSPM
agent: each new command transfer now gets the next available, monotonically
increasing token, until tokens are exhausted and the counter rolls over.

Such new policy mitigates the above issues with late/spurious responses
since the tokens are now reused as late as possible (when they roll back
ideally) and so it is much easier to identify such late/spurious responses
to stale timed out transactions: this also helps in simplifying the
specific transports implementation since stale transport messages can be
easily identified and discarded early on in the rx path without the need
to cross check their actual state with the core transport layer.
This mitigation is even more effective when, as is usually the case, the
maximum number of pending messages is capped by the platform to a much
lower number than the whole possible range of tokens values (2^10).

This internal policy change in the core SCMI transport layer is fully
transparent to the specific transports so it has not and should not have
any impact on the transports implementation.

Link: https://lore.kernel.org/r/20210803131024.40280-5-cristian.marussi@arm.com
Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>

firmware: arm_scmi: Add optional transport_init/exit support

Some SCMI transport could need to perform some transport specific setup
before they can be used by the SCMI core transport layer: typically this
early setup consists in registering with some other kernel subsystem.

Add the optional capability for a transport to provide a couple of init
and exit functions that are assured to be called early during the SCMI
core initialization phase, well before the SCMI core probing step.

[ Peter: Adapted RFC patch by Cristian for submission to upstream. ]

Link: https://lore.kernel.org/r/20210803131024.40280-4-cristian.marussi@arm.com
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
[ Cristian: Fixed scmi_transports_exit point of invocation ]
Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>

firmware: arm_scmi: Remove scmi_dump_header_dbg() helper

Being a while that we have SCMI trace events in the SCMI stack, remove
this debug helper and its call sites.

Link: https://lore.kernel.org/r/20210803131024.40280-3-cristian.marussi@arm.com
Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>

firmware: arm_scmi: Add support for type handling in common functions

Add SCMI type handling to pack/unpack_scmi_header common helper functions.
Initialize hdr.type properly when initializing a command xfer.

Link: https://lore.kernel.org/r/20210803131024.40280-2-cristian.marussi@arm.com
Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>

soc: mmsys: mediatek: add mask to mmsys routes

SOUT has many bits and need to be cleared before set new value.
Write only could do the clear, but for MOUT, it clears bits that
should not be cleared. So use a mask to reset only the needed bits.

this fixes HDMI issues on MT7623/BPI-R2 since 5.13

Fixes: 440147639ac7 ("soc: mediatek: mmsys: Use an array for setting the routing registers")
Signed-off-by: Frank Wunderlich <frank-w@public-files.de>
Signed-off-by: CK Hu <ck.hu@mediatek.com>
Reviewed-by: Chun-Kuang Hu <chunkuang.hu@kernel.org>
Reviewed-by: Hsin-Yi Wang <hsinyi@chromium.org>
Link: https://lore.kernel.org/r/20210729070549.5514-1-linux@fw-web.de
Signed-off-by: Matthias Brugger <matthias.bgg@gmail.com>

Merge tag 'ixp4xx-drivers-arm-soc-v5.15-1' of git://git./linux/kernel/git/linusw/linux-nomadik into arm/drivers

IXP4xx driver updates for modernizing the IXP4xx platforms,
taregeted for v5.15:

- Add DT bindings to the expansion bus and PATA libata driver.

- Add a new expansion bus driver.

- Rewrite the watchdog driver to use the watchdog core and
  spawn from the timer (clocksource) driver.

- Refactor the PATA/libata driver to probe from the device
  tree and use the expansion bus driver to manipulate chip
  select timings directly.

* tag 'ixp4xx-drivers-arm-soc-v5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-nomadik:
  pata: ixp4xx: Rewrite to use device tree
  pata: ixp4xx: Add DT bindings
  pata: ixp4xx: Refer to cmd and ctl rather than csN
  pata: ixp4xx: Use IS_ENABLED() to determine endianness
  pata: ixp4xx: Use local dev variable
  watchdog: ixp4xx: Rewrite driver to use core
  bus: ixp4xx: Add a driver for IXP4xx expansion bus
  bus: ixp4xx: Add DT bindings for the IXP4xx expansion bus

Link: https://lore.kernel.org/r/CACRpkdZaCosXsgp02nuUbd_nEvdxm5-z0+d0oSA97UTWQ0RQQg@mail.gmail.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

pata: ixp4xx: Rewrite to use device tree

This rewrites the IXP4xx CF (IDE) libata driver to use the
device tree exclusively to look up its resources:

- Probe exclusively from the device tree and look up all
  resources from there.
- Allocate a local state container with devres and pass
  this around in .private_data.
- Initialize with struct ata_port_info.
- Use the .set_piomode() callback instead of the much
  wider .set_mode(), we only support PIO after all.
- Bump driver version number from 0.2 to 1.0 to reflect this
  wider change.
- Get a handle on the expansion bus syscon regmap to alter
  the timings on the chip select.
- Put in the more elaborate timing adjustment code for PIO0
  to PIO4 in 8 and 16bit mode from the downstream OpenWrt
  patch.

The board file initialization path and platform data include
is dropped because the board files will be deleted at the same
time as this patch is merged.

The platform data file is not deleted right now so as not to
conflict with the removal of board files.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>

pata: ixp4xx: Add DT bindings

This adds device tree bindings for the Intel IXP4xx compact flash card
interface.

Cc: devicetree@vger.kernel.org
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>

pata: ixp4xx: Refer to cmd and ctl rather than csN

The two "cs0" and "cs1" are "chip selects" but on some
platforms such as GW2358 they are actually both in CS3
making this terminology very confusing. Call the
addresses "cmd" and "ctl" after function instead.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>

pata: ixp4xx: Use IS_ENABLED() to determine endianness

Instead of an ARM-specific ifdef, use the global CPU config
and if (IS_ENABLED()).

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>

pata: ixp4xx: Use local dev variable

Let's simplify all &pdev->dev references by creating a
local struct device *dev variable.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>

watchdog: ixp4xx: Rewrite driver to use core

This rewrites the IXP4xx watchdog driver as follows:

- Spawn the watchdog driver as a platform device from the timer
  driver. It's one device in the hardware, and the fact that
  Linux splits the handling into two different devices is
  a Linux pecularity, and thus it becomes a Linux pecularity
  to spawn a separate watchdog driver.

- Spawn the watchdog driver from the timer driver at probe().
  This is well after the timer driver as actually registered and
  started and we know the register base is available.

- Instead of looping back callbacks to the timer drivers for all
  watchdog calls, pass the register base to the watchdog driver
  and manage the registers there. The two drivers aren't even
  interested in the same register so the spinlock is totally
  surplus, delete it.

- Replace pretty much all of the content in the watchdog driver
  with a simple, modern watchdog driver utilizing the watchdog
  core instead of registering its own misc device and ioctl()
  handling.

- Drop module parameters as the same already exist in the
  watchdog core.

What remains is a slim elegant (IMO) watchdog driver using the
watchdog core, spawning from device tree or boardfile alike.

Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>

bus: ixp4xx: Add a driver for IXP4xx expansion bus

The Intel IXP4xx SoCs have an expansion bus that is usually just
used for flash memory and configured by the boot loaders and can
be accessed using the "simple-bus".

However some devices need more elaborate configuration and then we
need to provide a proper 3-unit address space indicating chip
select for each device and provide timing and similar information.

Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>

bus: ixp4xx: Add DT bindings for the IXP4xx expansion bus

This adds device tree bindings for the IXP4xx expansion bus controller.

Cc: Marc Zyngier <maz@kernel.org>
Cc: devicetree@vger.kernel.org
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>

Merge tag 'renesas-drivers-for-v5.15-tag1' of git://git./linux/kernel/git/geert/renesas-devel into arm/drivers

Renesas driver updates for v5.15

- Initial support for the new R-Car H3e-2G and M3e-2G SoCs.

* tag 'renesas-drivers-for-v5.15-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-devel:
soc: renesas: Identify R-Car H3e-2G and M3e-2G

Link: https://lore.kernel.org/r/cover.1627650704.git.geert+renesas@glider.be
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Linux 5.14-rc4

Merge tag 'perf-tools-fixes-for-v5.14-2021-08-01' of git://git./linux/kernel/git/acme/linux

Pull perf tools fixes from Arnaldo Carvalho de Melo:

- Revert "perf map: Fix dso->nsinfo refcounting", this makes 'perf top'
   abort, uncovering a design flaw on how namespace information is kept.
   The fix for that is more than we can do right now, leave it for the
   next merge window.

- Split --dump-raw-trace by AUX records for ARM's CoreSight, fixing up
   the decoding of some records.

- Fix PMU alias matching.

Thanks to James Clark and John Garry for these fixes.

* tag 'perf-tools-fixes-for-v5.14-2021-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
  Revert "perf map: Fix dso->nsinfo refcounting"
  perf pmu: Fix alias matching
  perf cs-etm: Split --dump-raw-trace by AUX records

Merge tag 'powerpc-5.14-4' of git://git./linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

- Don't use r30 in VDSO code, to avoid breaking existing Go lang
   programs.

- Change an export symbol to allow non-GPL modules to use spinlocks
   again.

Thanks to Paul Menzel, and Srikar Dronamraju.

* tag 'powerpc-5.14-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/vdso: Don't use r30 to avoid breaking Go lang
  powerpc/pseries: Fix regression while building external modules

Merge tag 'xfs-5.14-fixes-2' of git://git./fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
"This contains a bunch of bug fixes in XFS.

  Dave and I have been busy the last couple of weeks to find and fix as
  many log recovery bugs as we can find; here are the results so far. Go
  fstests -g recoveryloop! ;)

   - Fix a number of coordination bugs relating to cache flushes for
     metadata writeback, cache flushes for multi-buffer log writes, and
     FUA writes for single-buffer log writes

   - Fix a bug with incorrect replay of attr3 blocks

   - Fix unnecessary stalls when flushing logs to disk

   - Fix spoofing problems when recovering realtime bitmap blocks"

* tag 'xfs-5.14-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: prevent spoofing of rtbitmap blocks when recovering buffers
  xfs: limit iclog tail updates
  xfs: need to see iclog flags in tracing
  xfs: Enforce attr3 buffer recovery order
  xfs: logging the on disk inode LSN can make it go backwards
  xfs: avoid unnecessary waits in xfs_log_force_lsn()
  xfs: log forces imply data device cache flushes
  xfs: factor out forced iclog flushes
  xfs: fix ordering violation between cache flushes and tail updates
  xfs: fold __xlog_state_release_iclog into xlog_state_release_iclog
  xfs: external logs need to flush data device
  xfs: flush data dev on external log write

Merge tag '5.14-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
"Three cifs/smb3 fixes, including two for stable, and a fix for an
  fallocate problem noticed by Clang"

* tag '5.14-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: add missing parsing of backupuid
  smb3: rc uninitialized in one fallocate path
  SMB3: fix readpage for large swap cache

Merge tag 'net-5.14-rc4' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
"Networking fixes for 5.14-rc4, including fixes from bpf, can, WiFi
  (mac80211) and netfilter trees.

  Current release - regressions:

   - mac80211: fix starting aggregation sessions on mesh interfaces

  Current release - new code bugs:

   - sctp: send pmtu probe only if packet loss in Search Complete state

   - bnxt_en: add missing periodic PHC overflow check

   - devlink: fix phys_port_name of virtual port and merge error

   - hns3: change the method of obtaining default ptp cycle

   - can: mcba_usb_start(): add missing urb->transfer_dma initialization

  Previous releases - regressions:

   - set true network header for ECN decapsulation

   - mlx5e: RX, avoid possible data corruption w/ relaxed ordering and
     LRO

   - phy: re-add check for PHY_BRCM_DIS_TXCRXC_NOENRGY on the BCM54811
     PHY

   - sctp: fix return value check in __sctp_rcv_asconf_lookup

  Previous releases - always broken:

   - bpf:
       - more spectre corner case fixes, introduce a BPF nospec
         instruction for mitigating Spectre v4
       - fix OOB read when printing XDP link fdinfo
       - sockmap: fix cleanup related races

   - mac80211: fix enabling 4-address mode on a sta vif after assoc

   - can:
       - raw: raw_setsockopt(): fix raw_rcv panic for sock UAF
       - j1939: j1939_session_deactivate(): clarify lifetime of session
         object, avoid UAF
       - fix number of identical memory leaks in USB drivers

   - tipc:
       - do not blindly write skb_shinfo frags when doing decryption
       - fix sleeping in tipc accept routine"

* tag 'net-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (91 commits)
  gve: Update MAINTAINERS list
  can: esd_usb2: fix memory leak
  can: ems_usb: fix memory leak
  can: usb_8dev: fix memory leak
  can: mcba_usb_start(): add missing urb->transfer_dma initialization
  can: hi311x: fix a signedness bug in hi3110_cmd()
  MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver
  bpf: Fix leakage due to insufficient speculative store bypass mitigation
  bpf: Introduce BPF nospec instruction for mitigating Spectre v4
  sis900: Fix missing pci_disable_device() in probe and remove
  net: let flow have same hash in two directions
  nfc: nfcsim: fix use after free during module unload
  tulip: windbond-840: Fix missing pci_disable_device() in probe and remove
  sctp: fix return value check in __sctp_rcv_asconf_lookup
  nfc: s3fwrn5: fix undefined parameter values in dev_err()
  net/mlx5: Fix mlx5_vport_tbl_attr chain from u16 to u32
  net/mlx5e: Fix nullptr in mlx5e_hairpin_get_mdev()
  net/mlx5: Unload device upon firmware fatal error
  net/mlx5e: Fix page allocation failure for ptp-RQ over SF
  net/mlx5e: Fix page allocation failure for trap-RQ over SF
  ...

Merge tag 'acpi-5.14-rc4' of git://git./linux/kernel/git/rafael/linux-pm

Pull ACPI fixes from Rafael Wysocki:
"These revert a recent IRQ resources handling modification that turned
  out to be problematic, fix suspend-to-idle handling on AMD platforms
  to take upcoming systems into account properly and fix the retrieval
  of the DPTF attributes of the PCH FIVR.

  Specifics:

   - Revert recent change of the ACPI IRQ resources handling that
     attempted to improve the ACPI IRQ override selection logic, but
     introduced serious regressions on some systems (Hui Wang).

   - Fix up quirks for AMD platforms in the suspend-to-idle support code
     so as to take upcoming systems using uPEP HID AMDI007 into account
     as appropriate (Mario Limonciello).

   - Fix the code retrieving DPTF attributes of the PCH FIVR so that it
     agrees on the return data type with the ACPI control method
     evaluated for this purpose (Srinivas Pandruvada)"

* tag 'acpi-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: DPTF: Fix reading of attributes
  Revert "ACPI: resources: Add checks for ACPI IRQ override"
  ACPI: PM: Add support for upcoming AMD uPEP HID AMDI007

pipe: make pipe writes always wake up readers

Since commit 1b6b26ae7053 ("pipe: fix and clarify pipe write wakeup
logic") we have sanitized the pipe write logic, and would only try to
wake up readers if they needed it.

In particular, if the pipe already had data in it before the write,
there was no point in trying to wake up a reader, since any existing
readers must have been aware of the pre-existing data already.  Doing
extraneous wakeups will only cause potential thundering herd problems.

However, it turns out that some Android libraries have misused the EPOLL
interface, and expected "edge triggered" be to "any new write will
trigger it".  Even if there was no edge in sight.

Quoting Sandeep Patil:
"The commit 1b6b26ae7053 ('pipe: fix and clarify pipe write wakeup
  logic') changed pipe write logic to wakeup readers only if the pipe
  was empty at the time of write. However, there are libraries that
  relied upon the older behavior for notification scheme similar to
  what's described in [1]

  One such library 'realm-core'[2] is used by numerous Android
  applications. The library uses a similar notification mechanism as GNU
  Make but it never drains the pipe until it is full. When Android moved
  to v5.10 kernel, all applications using this library stopped working.

  The library has since been fixed[3] but it will be a while before all
  applications incorporate the updated library"

Our regression rule for the kernel is that if applications break from
new behavior, it's a regression, even if it was because the application
did something patently wrong.  Also note the original report [4] by
Michal Kerrisk about a test for this epoll behavior - but at that point
we didn't know of any actual broken use case.

So add the extraneous wakeup, to approximate the old behavior.

[ I say "approximate", because the exact old behavior was to do a wakeup
  not for each write(), but for each pipe buffer chunk that was filled
  in. The behavior introduced by this change is not that - this is just
  "every write will cause a wakeup, whether necessary or not", which
  seems to be sufficient for the broken library use. ]

It's worth noting that this adds the extraneous wakeup only for the
write side, while the read side still considers the "edge" to be purely
about reading enough from the pipe to allow further writes.

See commit f467a6a66419 ("pipe: fix and clarify pipe read wakeup logic")
for the pipe read case, which remains that "only wake up if the pipe was
full, and we read something from it".

Link: https://lore.kernel.org/lkml/CAHk-=wjeG0q1vgzu4iJhW5juPkTsjTYmiqiMUYAebWW+0bam6w@mail.gmail.com/
Link: https://github.com/realm/realm-core
Link: https://github.com/realm/realm-core/issues/4666
Link: https://lore.kernel.org/lkml/CAKgNAkjMBGeAwF=2MKK758BhxvW58wYTgYKB2V-gY1PwXxrH+Q@mail.gmail.com/
Link: https://lore.kernel.org/lkml/20210729222635.2937453-1-sspatil@android.com/
Reported-by: Sandeep Patil <sspatil@android.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Revert "perf map: Fix dso->nsinfo refcounting"

This makes 'perf top' abort in some cases, and the right fix will
involve surgery that is too much to do at this stage, so revert for now
and fix it in the next merge window.

This reverts commit 2d6b74baa7147251c30a46c4996e8cc224aa2dc5.

Cc: Riccardo Mancini <rickyman7@gmail.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Krister Johansen <kjlx@templeofstupid.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

Merge branches 'acpi-resources' and 'acpi-dptf'

* acpi-resources:
Revert "ACPI: resources: Add checks for ACPI IRQ override"

* acpi-dptf:
ACPI: DPTF: Fix reading of attributes

Merge tag 'block-5.14-2021-07-30' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

- gendisk freeing fix (Christoph)

- blk-iocost wake ordering fix (Tejun)

- tag allocation error handling fix (John)

- loop locking fix. While this isn't the prettiest fix in the world,
   nobody has any good alternatives for 5.14. Something to likely
   revisit for 5.15. (Tetsuo)

* tag 'block-5.14-2021-07-30' of git://git.kernel.dk/linux-block:
  block: delay freeing the gendisk
  blk-iocost: fix operation ordering in iocg_wake_fn()
  blk-mq-sched: Fix blk_mq_sched_alloc_tags() error handling
  loop: reintroduce global lock for safe loop_validate_file() traversal

Merge tag 'io_uring-5.14-2021-07-30' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

- A fix for block backed reissue (me)

- Reissue context hardening (me)

- Async link locking fix (Pavel)

* tag 'io_uring-5.14-2021-07-30' of git://git.kernel.dk/linux-block:
  io_uring: fix poll requests leaking second poll entries
  io_uring: don't block level reissue off completion path
  io_uring: always reissue from task_work context
  io_uring: fix race in unified task_work running
  io_uring: fix io_prep_async_link locking

Merge tag 'libata-5.14-2021-07-30' of git://git.kernel.dk/linux-block

Pull libata fixlets from Jens Axboe:

- A fix for PIO highmem (Christoph)

- Kill HAVE_IDE as it's now unused (Lukas)

* tag 'libata-5.14-2021-07-30' of git://git.kernel.dk/linux-block:
arch: Kconfig: clean up obsolete use of HAVE_IDE
libata: fix ata_pio_sector for CONFIG_HIGHMEM

Merge tag 'for-5.14-rc3-tag' of git://git./linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

- fix -Warray-bounds warning, to help external patchset to make it
   default treewide

- fix writeable device accounting (syzbot report)

- fix fsync and log replay after a rename and inode eviction

- fix potentially lost error code when submitting multiple bios for
   compressed range

* tag 'for-5.14-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: calculate number of eb pages properly in csum_tree_block
  btrfs: fix rw device counting in __btrfs_free_extra_devids
  btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction
  btrfs: mark compressed range uptodate only if all bio succeed

Merge branch 'for-linus' of git://git./linux/kernel/git/hid/hid

Pull HID fixes from Jiri Kosina:

- resume timing fix for intel-ish driver (Ye Xiang)

- fix for using incorrect MMIO register in amd_sfh driver (Dylan
   MacKenzie)

- Cintiq 24HDT / 27QHDT regression fix and touch processing fix for
   Wacom driver (Jason Gerecke)

- device removal bugfix for ft260 driver (Michael Zaidman)

- other small assorted fixes

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
  HID: ft260: fix device removal due to USB disconnect
  HID: wacom: Skip processing of touches with negative slot values
  HID: wacom: Re-enable touch by default for Cintiq 24HDT / 27QHDT
  HID: Kconfig: Fix spelling mistake "Uninterruptable" -> "Uninterruptible"
  HID: apple: Add support for Keychron K1 wireless keyboard
  HID: fix typo in Kconfig
  HID: ft260: fix format type warning in ft260_word_show()
  HID: amd_sfh: Use correct MMIO register for DMA address
  HID: asus: Remove check for same LED brightness on set
  HID: intel-ish-hid: use async resume function

Merge branch 'akpm' (patches from Andrew)

Merge misc fixes from Andrew Morton:
"7 patches.

  Subsystems affected by this patch series: lib, ocfs2, and mm (slub,
  migration, and memcg)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm/memcg: fix NULL pointer dereference in memcg_slab_free_hook()
  slub: fix unreclaimable slab stat for bulk free
  mm/migrate: fix NR_ISOLATED corruption on 64-bit
  mm: memcontrol: fix blocking rstat function called from atomic cgroup1 thresholding code
  ocfs2: issue zeroout to EOF blocks
  ocfs2: fix zero out valid data
  lib/test_string.c: move string selftest in the Runtime Testing menu

Merge tag 'linux-can-fixes-for-5.14-20210730' of git://git./linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2021-07-30

The first patch is by me and adds Yasushi SHOJI as a reviewer for the
Microchip CAN BUS Analyzer Tool driver.

Dan Carpenter's patch fixes a signedness bug in the hi311x driver.

Pavel Skripkin provides 4 patches, the first targets the mcba_usb
driver by adding the missing urb->transfer_dma initialization, which
was broken in a previous commit. The last 3 patches fix a memory leak
in the usb_8dev, ems_usb and esd_usb2 driver.

* tag 'linux-can-fixes-for-5.14-20210730' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
  can: esd_usb2: fix memory leak
  can: ems_usb: fix memory leak
  can: usb_8dev: fix memory leak
  can: mcba_usb_start(): add missing urb->transfer_dma initialization
  can: hi311x: fix a signedness bug in hi3110_cmd()
  MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver
====================

Link: https://lore.kernel.org/r/20210730070526.1699867-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mm/memcg: fix NULL pointer dereference in memcg_slab_free_hook()

When I use kfree_rcu() to free a large memory allocated by kmalloc_node(),
the following dump occurs.

  BUG: kernel NULL pointer dereference, address: 0000000000000020
  [...]
  Oops: 0000 [#1] SMP
  [...]
  Workqueue: events kfree_rcu_work
  RIP: 0010:__obj_to_index include/linux/slub_def.h:182 [inline]
  RIP: 0010:obj_to_index include/linux/slub_def.h:191 [inline]
  RIP: 0010:memcg_slab_free_hook+0x120/0x260 mm/slab.h:363
  [...]
  Call Trace:
    kmem_cache_free_bulk+0x58/0x630 mm/slub.c:3293
    kfree_bulk include/linux/slab.h:413 [inline]
    kfree_rcu_work+0x1ab/0x200 kernel/rcu/tree.c:3300
    process_one_work+0x207/0x530 kernel/workqueue.c:2276
    worker_thread+0x320/0x610 kernel/workqueue.c:2422
    kthread+0x13d/0x160 kernel/kthread.c:313
    ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294

When kmalloc_node() a large memory, page is allocated, not slab, so when
freeing memory via kfree_rcu(), this large memory should not be used by
memcg_slab_free_hook(), because memcg_slab_free_hook() is is used for
slab.

Using page_objcgs_check() instead of page_objcgs() in
memcg_slab_free_hook() to fix this bug.

Link: https://lkml.kernel.org/r/20210728145655.274476-1-wanghai38@huawei.com
Fixes: 270c6a71460e ("mm: memcontrol/slab: Use helpers to access slab page's memcg_data")
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

slub: fix unreclaimable slab stat for bulk free

SLUB uses page allocator for higher order allocations and update
unreclaimable slab stat for such allocations.  At the moment, the bulk
free for SLUB does not share code with normal free code path for these
type of allocations and have missed the stat update.  So, fix the stat
update by common code.  The user visible impact of the bug is the
potential of inconsistent unreclaimable slab stat visible through
meminfo and vmstat.

Link: https://lkml.kernel.org/r/20210728155354.3440560-1-shakeelb@google.com
Fixes: 6a486c0ad4dc ("mm, sl[ou]b: improve memory accounting")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/migrate: fix NR_ISOLATED corruption on 64-bit

Similar to commit 2da9f6305f30 ("mm/vmscan: fix NR_ISOLATED_FILE
corruption on 64-bit") avoid using unsigned int for nr_pages. With
unsigned int type the large unsigned int converts to a large positive
signed long.

Symptoms include CMA allocations hanging forever due to
alloc_contig_range->...->isolate_migratepages_block waiting forever in
"while (unlikely(too_many_isolated(pgdat)))".

Link: https://lkml.kernel.org/r/20210728042531.359409-1-aneesh.kumar@linux.ibm.com
Fixes: c5fc5c3ae0c8 ("mm: migrate: account THP NUMA migration counters correctly")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: memcontrol: fix blocking rstat function called from atomic cgroup1 thresholding code

Dan Carpenter reports:

    The patch 2d146aa3aa84: "mm: memcontrol: switch to rstat" from Apr
    29, 2021, leads to the following static checker warning:

    kernel/cgroup/rstat.c:200 cgroup_rstat_flush()
    warn: sleeping in atomic context

    mm/memcontrol.c
      3572  static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
      3573  {
      3574          unsigned long val;
      3575
      3576          if (mem_cgroup_is_root(memcg)) {
      3577                  cgroup_rstat_flush(memcg->css.cgroup);
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    This is from static analysis and potentially a false positive.  The
    problem is that mem_cgroup_usage() is called from __mem_cgroup_threshold()
    which holds an rcu_read_lock().  And the cgroup_rstat_flush() function
    can sleep.

      3578                  val = memcg_page_state(memcg, NR_FILE_PAGES) +
      3579                          memcg_page_state(memcg, NR_ANON_MAPPED);
      3580                  if (swap)
      3581                          val += memcg_page_state(memcg, MEMCG_SWAP);
      3582          } else {
      3583                  if (!swap)
      3584                          val = page_counter_read(&memcg->memory);
      3585                  else
      3586                          val = page_counter_read(&memcg->memsw);
      3587          }
      3588          return val;
      3589  }

__mem_cgroup_threshold() indeed holds the rcu lock.  In addition, the
thresholding code is invoked during stat changes, and those contexts
have irqs disabled as well.  If the lock breaking occurs inside the
flush function, it will result in a sleep from an atomic context.

Use the irqsafe flushing variant in mem_cgroup_usage() to fix this.

Link: https://lkml.kernel.org/r/20210726150019.251820-1-hannes@cmpxchg.org
Fixes: 2d146aa3aa84 ("mm: memcontrol: switch to rstat")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ocfs2: issue zeroout to EOF blocks

For punch holes in EOF blocks, fallocate used buffer write to zero the
EOF blocks in last cluster.  But since ->writepage will ignore EOF
pages, those zeros will not be flushed.

This "looks" ok as commit 6bba4471f0cc ("ocfs2: fix data corruption by
fallocate") will zero the EOF blocks when extend the file size, but it
isn't.  The problem happened on those EOF pages, before writeback, those
pages had DIRTY flag set and all buffer_head in them also had DIRTY flag
set, when writeback run by write_cache_pages(), DIRTY flag on the page
was cleared, but DIRTY flag on the buffer_head not.

When next write happened to those EOF pages, since buffer_head already
had DIRTY flag set, it would not mark page DIRTY again.  That made
writeback ignore them forever.  That will cause data corruption.  Even
directio write can't work because it will fail when trying to drop pages
caches before direct io, as it found the buffer_head for those pages
still had DIRTY flag set, then it will fall back to buffer io mode.

To make a summary of the issue, as writeback ingores EOF pages, once any
EOF page is generated, any write to it will only go to the page cache,
it will never be flushed to disk even file size extends and that page is
not EOF page any more.  The fix is to avoid zero EOF blocks with buffer
write.

The following code snippet from qemu-img could trigger the corruption.

  656   open("6b3711ae-3306-4bdd-823c-cf1c0060a095.conv.2", O_RDWR|O_DIRECT|O_CLOEXEC) = 11
  ...
  660   fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2275868672, 327680 <unfinished ...>
  660   fallocate(11, 0, 2275868672, 327680) = 0
  658   pwrite64(11, "

Link: https://lkml.kernel.org/r/20210722054923.24389-2-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ocfs2: fix zero out valid data

If append-dio feature is enabled, direct-io write and fallocate could
run in parallel to extend file size, fallocate used "orig_isize" to
record i_size before taking "ip_alloc_sem", when
ocfs2_zeroout_partial_cluster() zeroout EOF blocks, i_size maybe already
extended by ocfs2_dio_end_io_write(), that will cause valid data zeroed
out.

Link: https://lkml.kernel.org/r/20210722054923.24389-1-junxiao.bi@oracle.com
Fixes: 6bba4471f0cc ("ocfs2: fix data corruption by fallocate")
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

lib/test_string.c: move string selftest in the Runtime Testing menu

STRING_SELFTEST is presented in the "Library routines" menu.  Move it in
Kernel hacking > Kernel Testing and Coverage > Runtime Testing together
with other similar tests found in lib/

--- Runtime Testing
<*>   Test functions located in the hexdump module at runtime
<*>   Test string functions (NEW)
<*>   Test functions located in the string_helpers module at runtime
<*>   Test strscpy*() family of functions at runtime
<*>   Test kstrto*() family of functions at runtime
<*>   Test printf() family of functions at runtime
<*>   Test scanf() family of functions at runtime

Link: https://lkml.kernel.org/r/20210719185158.190371-1-mcroce@linux.microsoft.com
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Cc: Peter Rosin <peda@axentia.se>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

gve: Update MAINTAINERS list

The team maintaining the gve driver has undergone some changes,
this updates the MAINTAINERS file accordingly.

Signed-off-by: Catherine Sullivan <csully@google.com>
Signed-off-by: Jon Olson <jonolson@google.com>
Signed-off-by: David Awogbemila <awogbemila@google.com>
Signed-off-by: Jeroen de Borst <jeroendb@google.com>
Link: https://lore.kernel.org/r/20210729155258.442650-1-csully@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

arch: Kconfig: clean up obsolete use of HAVE_IDE

The arch-specific Kconfig files use HAVE_IDE to indicate if IDE is
supported.

As IDE support and the HAVE_IDE config vanishes with commit b7fb14d3ac63
("ide: remove the legacy ide driver"), there is no need to mention
HAVE_IDE in all those arch-specific Kconfig files.

The issue was identified with ./scripts/checkkconfigsymbols.py.

Fixes: b7fb14d3ac63 ("ide: remove the legacy ide driver")
Suggested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20210728182115.4401-1-lukas.bulwahn@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

can: esd_usb2: fix memory leak

In esd_usb2_setup_rx_urbs() MAX_RX_URBS coherent buffers are allocated
and there is nothing, that frees them:

1) In callback function the urb is resubmitted and that's all
2) In disconnect function urbs are simply killed, but URB_FREE_BUFFER
is not set (see esd_usb2_setup_rx_urbs) and this flag cannot be used
with coherent buffers.

So, all allocated buffers should be freed with usb_free_coherent()
explicitly.

Side note: This code looks like a copy-paste of other can drivers. The
same patch was applied to mcba_usb driver and it works nice with real
hardware. There is no change in functionality, only clean-up code for
coherent buffers.

Fixes: 96d8e90382dc ("can: Add driver for esd CAN-USB/2 device")
Link: https://lore.kernel.org/r/b31b096926dcb35998ad0271aac4b51770ca7cc8.1627404470.git.paskripkin@gmail.com
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: ems_usb: fix memory leak

In ems_usb_start() MAX_RX_URBS coherent buffers are allocated and
there is nothing, that frees them:

1) In callback function the urb is resubmitted and that's all
2) In disconnect function urbs are simply killed, but URB_FREE_BUFFER
is not set (see ems_usb_start) and this flag cannot be used with
coherent buffers.

So, all allocated buffers should be freed with usb_free_coherent()
explicitly.

Side note: This code looks like a copy-paste of other can drivers. The
same patch was applied to mcba_usb driver and it works nice with real
hardware. There is no change in functionality, only clean-up code for
coherent buffers.

Fixes: 702171adeed3 ("ems_usb: Added support for EMS CPC-USB/ARM7 CAN/USB interface")
Link: https://lore.kernel.org/r/59aa9fbc9a8cbf9af2bbd2f61a659c480b415800.1627404470.git.paskripkin@gmail.com
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: usb_8dev: fix memory leak

In usb_8dev_start() MAX_RX_URBS coherent buffers are allocated and
there is nothing, that frees them:

1) In callback function the urb is resubmitted and that's all
2) In disconnect function urbs are simply killed, but URB_FREE_BUFFER
is not set (see usb_8dev_start) and this flag cannot be used with
coherent buffers.

So, all allocated buffers should be freed with usb_free_coherent()
explicitly.

Side note: This code looks like a copy-paste of other can drivers. The
same patch was applied to mcba_usb driver and it works nice with real
hardware. There is no change in functionality, only clean-up code for
coherent buffers.

Fixes: 0024d8ad1639 ("can: usb_8dev: Add support for USB2CAN interface from 8 devices")
Link: https://lore.kernel.org/r/d39b458cd425a1cf7f512f340224e6e9563b07bd.1627404470.git.paskripkin@gmail.com
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: mcba_usb_start(): add missing urb->transfer_dma initialization

Yasushi reported, that his Microchip CAN Analyzer stopped working
since commit 91c02557174b ("can: mcba_usb: fix memory leak in
mcba_usb"). The problem was in missing urb->transfer_dma
initialization.

In my previous patch to this driver I refactored mcba_usb_start() code
to avoid leaking usb coherent buffers. To archive it, I passed local
stack variable to usb_alloc_coherent() and then saved it to private
array to correctly free all coherent buffers on ->close() call. But I
forgot to initialize urb->transfer_dma with variable passed to
usb_alloc_coherent().

All of this was causing device to not work, since dma addr 0 is not
valid and following log can be found on bug report page, which points
exactly to problem described above.

| DMAR: [DMA Write] Request device [00:14.0] PASID ffffffff fault addr 0 [fault reason 05] PTE Write access is not set

Fixes: 91c02557174b ("can: mcba_usb: fix memory leak in mcba_usb")
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=990850
Link: https://lore.kernel.org/r/20210725103630.23864-1-paskripkin@gmail.com
Cc: linux-stable <stable@vger.kernel.org>
Reported-by: Yasushi SHOJI <yasushi.shoji@gmail.com>
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Tested-by: Yasushi SHOJI <yashi@spacecubics.com>
[mkl: fixed typos in commit message - thanks Yasushi SHOJI]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: hi311x: fix a signedness bug in hi3110_cmd()

The hi3110_cmd() is supposed to return zero on success and negative
error codes on failure, but it was accidentally declared as a u8 when
it needs to be an int type.

Fixes: 57e83fb9b746 ("can: hi311x: Add Holt HI-311x CAN driver")
Link: https://lore.kernel.org/r/20210729141246.GA1267@kili
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver

This patch adds Yasushi SHOJI as a reviewer for the Microchip CAN BUS
Analyzer Tool driver.

Link: https://lore.kernel.org/r/20210726111619.1023991-1-mkl@pengutronix.de
Acked-by: Yasushi SHOJI <yashi@spacecubics.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

Merge tag 'drm-fixes-2021-07-30' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Dave Airlie:
"Regular drm fixes pull, seems about the right size, lots of small
  fixes across the board, mostly amdgpu, but msm and i915 are in there
  along with panel and ttm.

  amdgpu:
   - Fix resource leak in an error path
   - Avoid stack contents exposure in error path
   - pmops check fix for S0ix vs S3
   - DCN 2.1 display fixes
   - DCN 2.0 display fix
   - Backlight control fix for laptops with HDR panels
   - Maintainers updates

  i915:
   - Fix vbt port mask
   - Fix around reading the right DSC disable fuse in display_ver 10
   - Split display version 9 and 10 in intel_setup_outputs

  msm:
   - iommu fault display fix
   - misc dp compliance fixes
   - dpu reg sizing fix

  panel:
   - Fix bpc for ytc700tlag_05_201c

  ttm:
   - debugfs init fixes"

* tag 'drm-fixes-2021-07-30' of git://anongit.freedesktop.org/drm/drm:
  maintainers: add bugs and chat URLs for amdgpu
  drm/amdgpu/display: only enable aux backlight control for OLED panels
  drm/amd/display: ensure dentist display clock update finished in DCN20
  drm/amd/display: Add missing DCN21 IP parameter
  drm/amd/display: Guard DST_Y_PREFETCH register overflow in DCN21
  drm/amdgpu: Check pmops for desired suspend state
  drm/msm/dp: Initialize dp->aux->drm_dev before registration
  drm/msm/dp: signal audio plugged change at dp_pm_resume
  drm/msm/dp: Initialize the INTF_CONFIG register
  drm/msm/dp: use dp_ctrl_off_link_stream during PHY compliance test run
  drm/msm: Fix display fault handling
  drm/msm/dpu: Fix sm8250_mdp register length
  drm/amdgpu: Avoid printing of stack contents on firmware load error
  drm/amdgpu: Fix resource leak on probe error path
  drm/i915/display: split DISPLAY_VER 9 and 10 in intel_setup_outputs()
  drm/i915: fix not reading DSC disable fuse in GLK
  drm/i915/bios: Fix ports mask
  drm/panel: panel-simple: Fix proper bpc for ytc700tlag_05_201c
  drm/ttm: Initialize debugfs from ttm_global_init()

Merge tag 'fallthrough-fixes-clang-5.14-rc4' of git://git./linux/kernel/git/gustavoars/linux

Pull fallthrough fixes from Gustavo Silva:
"Fix some fall-through warnings when building with Clang and
  '-Wimplicit-fallthrough' on ARM"

* tag 'fallthrough-fixes-clang-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux:
  scsi: fas216: Fix fall-through warning for Clang
  scsi: acornscsi: Fix fall-through warning for clang
  ARM: riscpc: Fix fall-through warning for Clang

Merge branch 'for-linus' of git://git./linux/kernel/git/mattst88/alpha

Pull alpha updates from Matt Turner:
"They're mostly small janitorial fixes but there's also more important
  ones:

   - drop the alpha-specific x86 binary loader (David Hildenbrand)

   - regression fix for at least Marvel platforms (Mike Rapoport)

   - fix for a scary-looking typo (Zheng Yongjun)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mattst88/alpha:
  alpha: register early reserved memory in memblock
  alpha: fix spelling mistakes
  alpha: Remove space between * and parameter name
  alpha: fp_emul: avoid init/cleanup_module names
  alpha: Add syscall_get_return_value()
  binfmt: remove support for em86 (alpha only)
  alpha: fix typos in a comment
  alpha: defconfig: add necessary configs for boot testing
  alpha: Send stop IPI to send to online CPUs
  alpha: convert comma to semicolon
  alpha: remove undef inline in compiler.h
  alpha: Kconfig: Replace HTTP links with HTTPS ones
  alpha: __udiv_qrnnd should be exported

scsi: fas216: Fix fall-through warning for Clang

Fix the following fallthrough warning (on ARM):

drivers/scsi/arm/fas216.c:1379:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
           default:
           ^
   drivers/scsi/arm/fas216.c:1379:2: note: insert 'break;' to avoid fall-through
           default:
           ^
           break;

Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/lkml/202107260355.bF00i5bi-lkp@intel.com/
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>

scsi: acornscsi: Fix fall-through warning for clang

Fix the following fallthrough warning (on ARM):

drivers/scsi/arm/acornscsi.c:2651:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
           case res_success:
           ^
   drivers/scsi/arm/acornscsi.c:2651:2: note: insert '__attribute__((fallthrough));' to silence this warning
           case res_success:
           ^
           __attribute__((fallthrough));
   drivers/scsi/arm/acornscsi.c:2651:2: note: insert 'break;' to avoid fall-through
           case res_success:
           ^
           break;
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/lkml/202107260355.bF00i5bi-lkp@intel.com/
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>

ARM: riscpc: Fix fall-through warning for Clang

Fix the following fallthrough warning:

arch/arm/mach-rpc/riscpc.c:52:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
           default:
           ^
arch/arm/mach-rpc/riscpc.c:52:2: note: insert 'break;' to avoid fall-through
           default:
           ^
           break;

Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/lkml/202107260355.bF00i5bi-lkp@intel.com/
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>

Merge tag 'for-linus' of git://git./virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
"ARM:

   - Fix MTE shared page detection

   - Enable selftest's use of PMU registers when asked to

  s390:

   - restore 5.13 debugfs names

  x86:

   - fix sizes for vcpu-id indexed arrays

   - fixes for AMD virtualized LAPIC (AVIC)

   - other small bugfixes

  Generic:

   - access tracking performance test

   - dirty_log_perf_test command line parsing fix

   - Fix selftest use of obsolete pthread_yield() in favour of
     sched_yield()

   - use cpu_relax when halt polling

   - fixed missing KVM_CLEAR_DIRTY_LOG compat ioctl"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: add missing compat KVM_CLEAR_DIRTY_LOG
  KVM: use cpu_relax when halt polling
  KVM: SVM: use vmcb01 in svm_refresh_apicv_exec_ctrl
  KVM: SVM: tweak warning about enabled AVIC on nested entry
  KVM: SVM: svm_set_vintr don't warn if AVIC is active but is about to be deactivated
  KVM: s390: restore old debugfs names
  KVM: SVM: delay svm_vcpu_init_msrpm after svm->vmcb is initialized
  KVM: selftests: Introduce access_tracking_perf_test
  KVM: selftests: Fix missing break in dirty_log_perf_test arg parsing
  x86/kvm: fix vcpu-id indexed array sizes
  KVM: x86: Check the right feature bit for MSR_KVM_ASYNC_PF_ACK access
  docs: virt: kvm: api.rst: replace some characters
  KVM: Documentation: Fix KVM_CAP_ENFORCE_PV_FEATURE_CPUID name
  KVM: nSVM: Swap the parameter order for svm_copy_vmrun_state()/svm_copy_vmloadsave_state()
  KVM: nSVM: Rename nested_svm_vmloadsave() to svm_copy_vmloadsave_state()
  KVM: arm64: selftests: get-reg-list: actually enable pmu regs in pmu sublist
  KVM: selftests: change pthread_yield to sched_yield
  KVM: arm64: Fix detection of shared VMAs on guest fault

Merge branch 'for-linus' of git://git./linux/kernel/git/gerg/m68knommu

Pull m68knommu fix from Greg Ungerer:
"A single compile time fix"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
m68k/coldfire: change pll var. to clk_pll

xfs: prevent spoofing of rtbitmap blocks when recovering buffers

While reviewing the buffer item recovery code, the thought occurred to
me: in V5 filesystems we use log sequence number (LSN) tracking to avoid
replaying older metadata updates against newer log items. However, we
use the magic number of the ondisk buffer to find the LSN of the ondisk
metadata, which means that if an attacker can control the layout of the
realtime device precisely enough that the start of an rt bitmap block
matches the magic and UUID of some other kind of block, they can control
the purported LSN of that spoofed block and thereby break log replay.

Since realtime bitmap and summary blocks don't have headers at all, we
have no way to tell if a block really should be replayed. The best we
can do is replay unconditionally and hope for the best.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

xfs: limit iclog tail updates

From the department of "generic/482 keeps on giving", we bring you
another tail update race condition:

iclog:
S1 C1
+-----------------------+-----------------------+
S2 EOIC

Two checkpoints in a single iclog. One is complete, the other just
contains the start record and overruns into a new iclog.

Timeline:

Before S1: Cache flush, log tail = X
At S1: Metadata stable, write start record and checkpoint
At C1: Write commit record, set NEED_FUA
Single iclog checkpoint, so no need for NEED_FLUSH
Log tail still = X, so no need for NEED_FLUSH

After C1,
Before S2: Cache flush, log tail = X
At S2: Metadata stable, write start record and checkpoint
After S2: Log tail moves to X+1
At EOIC: End of iclog, more journal data to write
Releases iclog
Not a commit iclog, so no need for NEED_FLUSH
Writes log tail X+1 into iclog.

At this point, the iclog has tail X+1 and NEED_FUA set. There has
been no cache flush for the metadata between X and X+1, and the
iclog writes the new tail permanently to the log. THis is sufficient
to violate on disk metadata/journal ordering.

We have two options here. The first is to detect this case in some
manner and ensure that the partial checkpoint write sets NEED_FLUSH
when the iclog is already marked NEED_FUA and the log tail changes.
This seems somewhat fragile and quite complex to get right, and it
doesn't actually make it obvious what underlying problem it is
actually addressing from reading the code.

The second option seems much cleaner to me, because it is derived
directly from the requirements of the C1 commit record in the iclog.
That is, when we write this commit record to the iclog, we've
guaranteed that the metadata/data ordering is correct for tail
update purposes. Hence if we only write the log tail into the iclog
for the *first* commit record rather than the log tail at the last
release, we guarantee that the log tail does not move past where the
the first commit record in the log expects it to be.

IOWs, taking the first option means that replay of C1 becomes
dependent on future operations doing the right thing, not just the
C1 checkpoint itself doing the right thing. This makes log recovery
almost impossible to reason about because now we have to take into
account what might or might not have happened in the future when
looking at checkpoints in the log rather than just having to
reconstruct the past...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: need to see iclog flags in tracing

Because I cannot tell if the NEED_FLUSH flag is being set correctly
by the log force and CIL push machinery without it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: Enforce attr3 buffer recovery order

From the department of "WTAF? How did we miss that!?"...

When we are recovering a buffer, the first thing we do is check the
buffer magic number and extract the LSN from the buffer. If the LSN
is older than the current LSN, we replay the modification to it. If
the metadata on disk is newer than the transaction in the log, we
skip it. This is a fundamental v5 filesystem metadata recovery
behaviour.

generic/482 failed with an attribute writeback failure during log
recovery. The write verifier caught the corruption before it got
written to disk, and the attr buffer dump looked like:

XFS (dm-3): Metadata corruption detected at xfs_attr3_leaf_verify+0x275/0x2e0, xfs_attr3_leaf block 0x19be8
XFS (dm-3): Unmount and run xfs_repair
XFS (dm-3): First 128 bytes of corrupted metadata buffer:
00000000: 00 00 00 00 00 00 00 00 3b ee 00 00 4d 2a 01 e1  ........;...M*..
00000010: 00 00 00 00 00 01 9b e8 00 00 00 01 00 00 05 38  ...............8
                                  ^^^^^^^^^^^^^^^^^^^^^^^
00000020: df 39 5e 51 58 ac 44 b6 8d c5 e7 10 44 09 bc 17  .9^QX.D.....D...
00000030: 00 00 00 00 00 02 00 83 00 03 00 cc 0f 24 01 00  .............$..
00000040: 00 68 0e bc 0f c8 00 10 00 00 00 00 00 00 00 00  .h..............
00000050: 00 00 3c 31 0f 24 01 00 00 00 3c 32 0f 88 01 00  ..<1.$....<2....
00000060: 00 00 3c 33 0f d8 01 00 00 00 00 00 00 00 00 00  ..<3............
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
.....

The highlighted bytes are the LSN that was replayed into the
buffer: 0x100000538. This is cycle 1, block 0x538. Prior to replay,
that block on disk looks like this:

$ sudo xfs_db -c "fsb 0x417d" -c "type attr3" -c p /dev/mapper/thin-vol
hdr.info.hdr.forw = 0
hdr.info.hdr.back = 0
hdr.info.hdr.magic = 0x3bee
hdr.info.crc = 0xb5af0bc6 (correct)
hdr.info.bno = 105448
hdr.info.lsn = 0x100000900
               ^^^^^^^^^^^
hdr.info.uuid = df395e51-58ac-44b6-8dc5-e7104409bc17
hdr.info.owner = 131203
hdr.count = 2
hdr.usedbytes = 120
hdr.firstused = 3796
hdr.holes = 1
hdr.freemap[0-2] = [base,size]

Note the LSN stamped into the buffer on disk: 1/0x900. The version
on disk is much newer than the log transaction that was being
replayed. That's a bug, and should -never- happen.

So I immediately went to look at xlog_recover_get_buf_lsn() to check
that we handled the LSN correctly. I was wondering if there was a
similar "two commits with the same start LSN skips the second
replay" problem with buffers. I didn't get that far, because I found
a much more basic, rudimentary bug: xlog_recover_get_buf_lsn()
doesn't recognise buffers with XFS_ATTR3_LEAF_MAGIC set in them!!!

IOWs, attr3 leaf buffers fall through the magic number checks
unrecognised, so trigger the "recover immediately" behaviour instead
of undergoing an LSN check. IOWs, we incorrectly replay ATTR3 leaf
buffers and that causes silent on disk corruption of inode attribute
forks and potentially other things....

Git history shows this is *another* zero day bug, this time
introduced in commit 50d5c8d8e938 ("xfs: check LSN ordering for v5
superblocks during recovery") which failed to handle the attr3 leaf
buffers in recovery. And we've failed to handle them ever since...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: logging the on disk inode LSN can make it go backwards

When we log an inode, we format the "log inode" core and set an LSN
in that inode core. We do that via xfs_inode_item_format_core(),
which calls:

xfs_inode_to_log_dinode(ip, dic, ip->i_itemp->ili_item.li_lsn);

to format the log inode. It writes the LSN from the inode item into
the log inode, and if recovery decides the inode item needs to be
replayed, it recovers the log inode LSN field and writes it into the
on disk inode LSN field.

Now this might seem like a reasonable thing to do, but it is wrong
on multiple levels. Firstly, if the item is not yet in the AIL,
item->li_lsn is zero. i.e. the first time the inode it is logged and
formatted, the LSN we write into the log inode will be zero. If we
only log it once, recovery will run and can write this zero LSN into
the inode.

This means that the next time the inode is logged and log recovery
runs, it will *always* replay changes to the inode regardless of
whether the inode is newer on disk than the version in the log and
that violates the entire purpose of recording the LSN in the inode
at writeback time (i.e. to stop it going backwards in time on disk
during recovery).

Secondly, if we commit the CIL to the journal so the inode item
moves to the AIL, and then relog the inode, the LSN that gets
stamped into the log inode will be the LSN of the inode's current
location in the AIL, not it's age on disk. And it's not the LSN that
will be associated with the current change. That means when log
recovery replays this inode item, the LSN that ends up on disk is
the LSN for the previous changes in the log, not the current
changes being replayed. IOWs, after recovery the LSN on disk is not
in sync with the LSN of the modifications that were replayed into
the inode. This, again, violates the recovery ordering semantics
that on-disk writeback LSNs provide.

Hence the inode LSN in the log dinode is -always- invalid.

Thirdly, recovery actually has the LSN of the log transaction it is
replaying right at hand - it uses it to determine if it should
replay the inode by comparing it to the on-disk inode's LSN. But it
doesn't use that LSN to stamp the LSN into the inode which will be
written back when the transaction is fully replayed. It uses the one
in the log dinode, which we know is always going to be incorrect.

Looking back at the change history, the inode logging was broken by
commit 93f958f9c41f ("xfs: cull unnecessary icdinode fields") way
back in 2016 by a stupid idiot who thought he knew how this code
worked. i.e. me. That commit replaced an in memory di_lsn field that
was updated only at inode writeback time from the inode item.li_lsn
value - and hence always contained the same LSN that appeared in the
on-disk inode - with a read of the inode item LSN at inode format
time. CLearly these are not the same thing.

Before 93f958f9c41f, the log recovery behaviour was irrelevant,
because the LSN in the log inode always matched the on-disk LSN at
the time the inode was logged, hence recovery of the transaction
would never make the on-disk LSN in the inode go backwards or get
out of sync.

A symptom of the problem is this, caught from a failure of
generic/482. Before log recovery, the inode has been allocated but
never used:

xfs_db> inode 393388
xfs_db> p
core.magic = 0x494e
core.mode = 0
....
v3.crc = 0x99126961 (correct)
v3.change_count = 0
v3.lsn = 0
v3.flags2 = 0
v3.cowextsize = 0
v3.crtime.sec = Thu Jan 1 10:00:00 1970
v3.crtime.nsec = 0

After log recovery:

xfs_db> p
core.magic = 0x494e
core.mode = 020444
....
v3.crc = 0x23e68f23 (correct)
v3.change_count = 2
v3.lsn = 0
v3.flags2 = 0
v3.cowextsize = 0
v3.crtime.sec = Thu Jul 22 17:03:03 2021
v3.crtime.nsec = 751000000
...

You can see that the LSN of the on-disk inode is 0, even though it
clearly has been written to disk. I point out this inode, because
the generic/482 failure occurred because several adjacent inodes in
this specific inode cluster were not replayed correctly and still
appeared to be zero on disk when all the other metadata (inobt,
finobt, directories, etc) indicated they should be allocated and
written back.

The fix for this is two-fold. The first is that we need to either
revert the LSN changes in 93f958f9c41f or stop logging the inode LSN
altogether. If we do the former, log recovery does not need to
change but we add 8 bytes of memory per inode to store what is
largely a write-only inode field. If we do the latter, log recovery
needs to stamp the on-disk inode in the same manner that inode
writeback does.

I prefer the latter, because we shouldn't really be trying to log
and replay changes to the on disk LSN as the on-disk value is the
canonical source of the on-disk version of the inode. It also
matches the way we recover buffer items - we create a buf_log_item
that carries the current recovery transaction LSN that gets stamped
into the buffer by the write verifier when it gets written back
when the transaction is fully recovered.

However, this might break log recovery on older kernels even more,
so I'm going to simply ignore the logged value in recovery and stamp
the on-disk inode with the LSN of the transaction being recovered
that will trigger writeback on transaction recovery completion. This
will ensure that the on-disk inode LSN always reflects the LSN of
the last change that was written to disk, regardless of whether it
comes from log recovery or runtime writeback.

Fixes: 93f958f9c41f ("xfs: cull unnecessary icdinode fields")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: avoid unnecessary waits in xfs_log_force_lsn()

Before waiting on a iclog in xfs_log_force_lsn(), we don't check to
see if the iclog has already been completed and the contents on
stable storage. We check for completed iclogs in xfs_log_force(), so
we should do the same thing for xfs_log_force_lsn().

This fixed some random up-to-30s pauses seen in unmounting
filesystems in some tests. A log force ends up waiting on completed
iclog, and that doesn't then get flushed (and hence the log force
get completed) until the background log worker issues a log force
that flushes the iclog in question. Then the unmount unblocks and
continues.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: log forces imply data device cache flushes

After fixing the tail_lsn vs cache flush race, generic/482 continued
to fail in a similar way where cache flushes were missing before
iclog FUA writes. Tracing of iclog state changes during the fsstress
workload portion of the test (via xlog_iclog* events) indicated that
iclog writes were coming from two sources - CIL pushes and log
forces (due to fsync/O_SYNC operations). All of the cases where a
recovery problem was triggered indicated that the log force was the
source of the iclog write that was not preceeded by a cache flush.

This was an oversight in the modifications made in commit
eef983ffeae7 ("xfs: journal IO cache flush reductions"). Log forces
for fsync imply a data device cache flush has been issued if an
iclog was flushed to disk and is indicated to the caller via the
log_flushed parameter so they can elide the device cache flush if
the journal issued one.

The change in eef983ffeae7 results in iclogs only issuing a cache
flush if XLOG_ICL_NEED_FLUSH is set on the iclog, but this was not
added to the iclogs that the log force code flushes to disk. Hence
log forces are no longer guaranteeing that a cache flush is issued,
hence opening up a potential on-disk ordering failure.

Log forces should also set XLOG_ICL_NEED_FUA as well to ensure that
the actual iclogs it forces to the journal are also on stable
storage before it returns to the caller.

This patch introduces the xlog_force_iclog() helper function to
encapsulate the process of taking a reference to an iclog, switching
its state if WANT_SYNC and flushing it to stable storage correctly.

Both xfs_log_force() and xfs_log_force_lsn() are converted to use
it, as is xlog_unmount_write() which has an elaborate method of
doing exactly the same "write this iclog to stable storage"
operation.

Further, if the log force code needs to wait on a iclog in the
WANT_SYNC state, it needs to ensure that iclog also results in a
cache flush being issued. This covers the case where the iclog
contains the commit record of the CIL flush that the log force
triggered, but it hasn't been written yet because there is still an
active reference to the iclog.

Note: this whole cache flush whack-a-mole patch is a result of log
forces still being iclog state centric rather than being CIL
sequence centric. Most of this nasty code will go away in future
when log forces are converted to wait on CIL sequence push
completion rather than iclog completion. With the CIL push algorithm
guaranteeing that the CIL checkpoint is fully on stable storage when
it completes, we no longer need to iterate iclogs and push them to
ensure a CIL sequence push has completed and so all this nasty iclog
iteration and flushing code will go away.

Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: factor out forced iclog flushes

We force iclogs in several places - we need them all to have the
same cache flush semantics, so start by factoring out the iclog
force into a common helper.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: fix ordering violation between cache flushes and tail updates

There is a race between the new CIL async data device metadata IO
completion cache flush and the log tail in the iclog the flush
covers being updated. This can be seen by repeating generic/482 in a
loop and eventually log recovery fails with a failures such as this:

XFS (dm-3): Starting recovery (logdev: internal)
XFS (dm-3): bad inode magic/vsn daddr 228352 #0 (magic=0)
XFS (dm-3): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x37c00 xfs_inode_buf_verify
XFS (dm-3): Unmount and run xfs_repair
XFS (dm-3): First 128 bytes of corrupted metadata buffer:
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
XFS (dm-3): metadata I/O error in "xlog_recover_items_pass2+0x55/0xc0" at daddr 0x37c00 len 32 error 117

Analysis of the logwrite replay shows that there were no writes to
the data device between the FUA @ write 124 and the FUA at write @
125, but log recovery @ 125 failed. The difference was the one log
write @ 125 moved the tail of the log forwards from (1,8) to (1,32)
and so the inode create intent in (1,8) was not replayed and so the
inode cluster was zero on disk when replay of the first inode item
in (1,32) was attempted.

What this meant was that the journal write that occurred at @ 125
did not ensure that metadata completed before the iclog was written
was correctly on stable storage. The tail of the log moved forward,
so IO must have been completed between the two iclog writes. This
means that there is a race condition between the unconditional async
cache flush in the CIL push work and the tail LSN that is written to
the iclog. This happens like so:

CIL push work AIL push work
------------- -------------
Add to committing list
start async data dev cache flush
.....
<flush completes>
<all writes to old tail lsn are stable>
xlog_write
  .... push inode create buffer
<start IO>
.....
xlog_write(commit record)
  .... <IO completes>
   log tail moves
     xlog_assign_tail_lsn()
start_lsn == commit_lsn
  <no iclog preflush!>
xlog_state_release_iclog
  __xlog_state_release_iclog()
    <writes *new* tail_lsn into iclog>
  xlog_sync()
    ....
    submit_bio()
<tail in log moves forward without flushing written metadata>

Essentially, this can only occur if the commit iclog is issued
without a cache flush. If the iclog bio is submitted with
REQ_PREFLUSH, then it will guarantee that all the completed IO is
one stable storage before the iclog bio with the new tail LSN in it
is written to the log.

IOWs, the tail lsn that is written to the iclog needs to be sampled
*before* we issue the cache flush that guarantees all IO up to that
LSN has been completed.

To fix this without giving up the performance advantage of the
flush/FUA optimisations (e.g. g/482 runtime halves with 5.14-rc1
compared to 5.13), we need to ensure that we always issue a cache
flush if the tail LSN changes between the initial async flush and
the commit record being written. THis requires sampling the tail_lsn
before we start the flush, and then passing the sampled tail LSN to
xlog_state_release_iclog() so it can determine if the the tail LSN
has changed while writing the checkpoint. If the tail LSN has
changed, then it needs to set the NEED_FLUSH flag on the iclog and
we'll issue another cache flush before writing the iclog.

Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: fold __xlog_state_release_iclog into xlog_state_release_iclog

Fold __xlog_state_release_iclog into its only caller to prepare
make an upcoming fix easier.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>